When it comes to becoming a data scientist, there are still a lot of challenges that are being faced in managing and aggregating this data so that it becomes useful. These challenges can be overcome by learning from the success stories of other fellow data scientists who have completed a data analytics project after pivoting from the original idea and later delivered positive results to their organization.
We wanted to understand what are the key lessons that data scientists in large companies have learned while working on their big data analytics project. Therefore, we asked AIM Expert Network (AEN) members to share a lesson in the form of an insight that they have recently learned. In this article, AEN members have shared what they were planning to achieve with the help of big data analytics, at what point they realized a pivot is required and the key lesson they learned from this process.
This article will help other fellow data scientists to avoid common mistakes which they may make while executing data analysis operations for their organization.
Convert Data Pipelines built on Traditional Databases to Big Data Platform
Initial Plan: A few years back when we started transitioning our Data pipelines as part of “BI Modernization using Big Data” enterprise-wide project, we believed code movement which included Stored Procedures, Macros, SQLs in Teradata to Big Data tool Apache Hive would be:-
- Mostly lifted and shift
- Only 10% would require code refactoring
As Hive was SQL 2 compliant and Teradata was SQL 3 compliant.
Pivot point: When we went into the build phase of the project, we realized that there is a lot of tuning which is needed to be done at multiple places when it comes to Hive queries for e.g. leverage SMB join (Sort Merge Bucket Join), converting your slowly changing dimension (SCD1 and 2) pipelines which required more than 40% of code refactoring because hive does not support updates, etc.
Lesson Learnt: With the explosion of data, there has been an increase in multiple database tools in the market and SQL is the choice for multiple data analytics projects. Database vendors while approaching the office of CIO or Enterprise IT head, use ANSI SQL compliance tools as they are the standard for helping organizations to make a decision. ANSI SQL Standards are used to provide assurance that the adoption of new database technology would require minimal reskilling of employees. However, enterprises must investigate the fine print of what all features and functionalities are supported by database tools and then make informed decisions whether it makes business sense for migration to the new tool.
Ranjan Relan, Data Strategy and Tech Consultant – ZS Associates
Avoiding Data Swamps
There is a little doubt that Big Data empowers us. However, as the classic saying goes – with great power comes great responsibility, the boon of big data can easily turn into a bane, if mishandled.
One of the classic cases we initially encountered with our big data projects was the problem of data swamps: an uncontrolled state of a Data Lake. With the ease of ingesting the data into Data Lakes, you can easily lose control over the data lakes, thus making it worthless.
Hence, in order to overcome this, we adopted strict governance and security practices which is the process of transforming the Data Lake into a Data Hub.
Prasad Kulkarni, Senior Software Engineer.
Implementing the Apache Kafta system for data streaming
Our goal was to analyze the tweets on the product ‘iPhone’ during the launch event. This helps in getting the initial sentiments around the tweets. Initially, we decided to pull the tweets using python packages and pass on to the scripts to predict the sentiments scores.
We soon realized that the plan needed a revision with the introduction of Streaming data tools that can bring in live data & send it over to the system for score predictions. We implemented a Kafka system which depends on the Producer-Consumer concept.
- Plan for a more robust system rather than a single attempt work
- Be ready for multi-lingual data (We had lots of Chinese tweet data).
- As the data grows normal computing can make it difficult to execute.
- Usage of In-Memory tools such as Spark can help do computing at a faster pace.
Vijayakeerthi Jayakumar, Data Scientist – Cognizant Technology Solutions
Aiming for scale and speed with your big data project
While working on a Big Data Analytics project our main aim was to have scale and speed for the data. The aim of the project was to build a real-time dashboard for the business users to check the inflow of orders that are being processed. Data Integration was biggest challenge as data was coming from various sources here. Data was formatted, structured and insights were drawn from the descriptive analytics done on the data. I personally prefer SQL over No SQL so we used Google BigQuery to store the data. During this project, we realized that Big Data is the need of the hour to accommodate massively increasing data. But different databases are meant for different kinds of data and should be selected precisely in line with the type of data we are dealing with.
Netali Agrawal, Technology lead – Infosys.
Prioritizing use cases in order to build a tangible value of your data
We wanted to start with a small MVP or Pilot with a specific use case or two in mind. We initially planned to build the MVP in over 6 to 8 weeks followed by the sprint-based Implementation phase of every 8 to 12 weeks, Implementation Phase could be further divided into Implementation (implement for 3-4 functional areas) and Industrialisation (implement the solution across markets).
- Document learnings from the Pilot/MVP to be considered in the larger implementation phase.
- Use Agile delivery sprints to deliver incremental changes which the business can continuously consume (every 2-3 months), keeping the participation and feedback loops alive.
- For any Big Data implementation, use cases need to be prioritized and fitted into the implementation plan for the solution to address the business pain points and bring tangible value.
- Revisit the business case after implementation to see the benefits gained and adjust the original business case accordingly. This is vital to building trust with the business consumers of data and analytics services.
- Ensure data ingested into the platform has suitable data security and governance controls to ensure the quality of insights.
Saumya Chaki, Data Platforms Solutioning Leader – IBM
Using Feature Engineering (FE) to improve prediction power
“Our company decided to run a big data project to find out two main business features that could help us reduce customer churn by at least 25 percent. We created a team of experts from our Business Intelligence (BI) and data reporting departments. Upon starting, we identified that billing and sales data has to be our main base for this analysis. Through exploratory data analysis (EDA) techniques, we recognized a set of features that were showing a correlation with churn. We picked up high correlation features and created various models using training data. During our testing, all the models failed miserably on test data. We revisited all the features and applied feature engineering techniques to derive services-used group features which improved the prediction significantly. A key lesson that I learned from this process was to go beyond obvious features and perform feature engineering to find more impacting abstract business attributes to improve the model’s prediction power.”
Anil Sharma, Senior IT Architect – Colt Technology Services.