Each day we generate 2.5 quintillion bytes of data. All the data that is being generated by us while using the internet is raw and cannot be used by an organization to make data-driven decisions. We wanted to understand how data scientists have evolved in 2020 and the type of tools they are using now to tackle the challenges with these new and different forms of data.
Therefore, we asked AIM Expert Network (AEN) members to share their insights on the challenges they face while processing different types of raw data and how they convert the same data into valuable assets for their organization.
- Getting data from multiple sources
While building the anomaly detection system for our app we faced an issue with a huge amount of data that was coming from different sources/databases. The biggest challenge that we faced was to consider all forms of data being generated from the app and make it into one single format to centralize the observation. A real-time querying production database was also not possible.
So action required here was to get this unstructured data all together in one database. For this, we used Google’s BigQuery, a relational database to have the data stored. The refresh cycle was twice daily to get data from our database and files and dump in the BigQuery database. Post this we worked on processing this data further.
– Netali Agrawal, Technology lead – Infosys.
2. Unlocking value out of Unstructured Text Data
A major chunk of data that is stored by enterprises around the world is unstructured text data. Traditionally, an enormous amount of time, effort and resources have been spent by analysts around the world in data processing by transforming unstructured text data into a standardized format to find insights out of it. Overall, results have varied due to lack of right technology and unfortunately mostly low intelligence insights being derived out of data with benefits being outweighed by the cost.
Solution: Ontologies and Graph Databases
Recently, enterprises have realized the impact of using Ontologies which has reduced the burden on Data Processing from data engineers with its increased adoption. Ontologies help define common vocabulary and help in smooth knowledge management. Also, with the increased maturity and awareness of Graph databases (such as Neo4J, AWS Neptune, etc.) which are used for knowledge management and finding connections in text data, organizations are able to unlock value out of unstructured data.
Ranjan Relan, Data Strategy and Tech Consultant – ZS Associates
3. Setting up the infrastructure and velocity of data
The primary challenge in handling modern data requirements (especially streaming) is setting up the infrastructure owing to high volumes and velocity of data. This can be handled in a very efficient manner by using data streaming cloud services like Microsoft Azure. Accordingly, two PaaS services stand out viz. Azure Stream Analytics and Azure Databricks.
The former is a first-party streaming service that gels well with messaging services like Azure IoT Hub or Event Hub. The article ‘An Introduction to Azure IoT with Machine Learning’ elucidates more on this. However, the latter i.e. Azure Databricks is a unified analytics platform to implement Lambda Architecture. Details on this can be found here: Lambda Architecture with Azure Databricks
- Prasad Kulkarni, Senior Software Engineer – Nuance Communications
4. Adapting to different tools to collect unstructured data
The biggest challenge now and going forward in data processing is a change in the type of data that is coming in. Previously all the data was structured, but now, a lot of data is coming in an unstructured format from numerous sources like social media platforms, emails or shared cloud storage platforms. Analyzing, processing and storing this data has become a challenge that organizations are grappling with even today.
The first thing in my opinion that any organization looking to become more data-driven needs to do is to revisit their data strategy including data collection mechanism, data entry points and the tools used for data processing and integration. The data teams need to appreciate that no one tool can help them with all the challenges so they should be open to use and adopt new tools or processes for data management. The fundamentals of data management will remain same which will include having a robust data model (virtual, federated or physical), having a way to build trust in the data (with good and robust data quality processes in place) and lastly but most importantly having a way for larger group to know what data is available and its business definition (having a metadata strategy in place).
As the data is useful and valuable only when it is available and accessible to all, we should be building our models in such a way that savvy users are able to take that and add their data to it using front end tools like Tableau and Qlik. In case there are too many sources of data then using tools like alteryx to enable self-service ETL in a controlled environment for the users can also be done. With the advent of cloud, the storage problem has been solved to a large extent as the data architects can create an “elastic” warehouse or lake in the cloud and not worry about storage space limitations.
- Amit Agarwal, Senior Manager (IT) – Nvidia Graphics
5. Building a robust strategy before collecting data
Life doesn’t start when a child takes birth, it starts when life is formed inside the mother’s womb. Similarly, a data analysis project doesn’t start when you get the data, it starts when you start collecting the data. Once we understand this, only then we will know what all analysis is possible and not possible with our data.
Suppose we have some known goals of analysis and the data is to be collected through some survey, we should be designing the questionnaire such that everything we need is captured precisely. This will reduce disappointments during analysis. And as we already know that we do not have the required attributes collected for certain analyses, we won’t be wasting our time trying something out. This will ultimately save time and regrets.
- Prakash B Pimpale, Principal Technical Officer – CDAC Mumbai
6. Understanding the quality based on the semantics of data
Enterprises see huge opportunities in Big Data Analytics by integrating data from both internal and external sources of data including structured, semi-structured (weblogs) and unstructured data (Social media feeds).
New use cases around Sentiment Analysis and Customer Feedback arise from the processing of unstructured data from call logs or chatbots. This can provide vital insights into why a customer is not happy about the services provided or a product and can be used to improve service quality and enhance customer satisfaction.
Data Security and Privacy are key in countries with data protection regulations like the GDPR in EU. Enterprises must process data-keeping design principles like Secure by design in mind. Principles of data anonymization and minimalization are key here to address the security and privacy concerns.
Business and Technical Metadata across the diverse data domains is crucial for enterprises managing large complex data sets to have the semantics of the data being collected, processed and analyzed. The quality of insights is dependent on understanding the semantics of data.
- Saumya Chaki, Data Platforms Solutioning Leader – IBM
7. Building a strong Data Foundation
We all work in different domains with a variety of sources and the biggest challenge is to have a holistic centralized view. This needs to be viewed in a different perspective more than storage, as a complete RDM and MDM powered Single Source of Truth. This involves a mammoth effort in building such a strong data foundation but worth its benefit on a long-run perspective.
A broader solution would lie in creating an Enterprise-wide initiative to form a data strategy along with a number of specialized teams in areas like Data Governance, Data Quality, RDM, MDM, Data Stewardship, Data Integration and Data processing. There are plenty of tools available in the market for these different areas and having Sysops and Dataops teams in place will help in provisioning these services.
In a nutshell, challenges in data processing will continue to grow until we have a good and robust data foundation in place.
Ravichander Rajendran, Data Analysis & AI Engineering Lead – AstraZeneca