Every machine learning lifecycle starts with a business problem. It can be about the increase in sales, make predictions, cutting costs or whatever it is that brings profits to the organisation. Once a business problem is classified under artificial intelligence or machine learning, the next question is — what data will be used to solve it?
Answering this question is followed by integrating pipelines where the data gets processed, exploratory data analysis is carried out and feature engineering is done. In order to meet the growing demands of the customers, companies which use data platforms have to handle the scale, agility, and flexibility to combine different types of data and analytics approaches which will allow them to transform data into a valuable corporate asset.
Traditional data platforms make it difficult to function when the business landscapes are evolving along with the way data source schemas are changing. The quality of data too, has varied and traditional data platforms are falling short of providing desirable insights.
Combining and exploiting data for better insights is a challenge that requires novel strategies focused on delivering incremental value that provides an organisation with capabilities that are holistic and future-proof.
What’s Wrong With Traditional Data Ecosystems
Traditional data ecosystems usually consist of a staging layer, an operational data store, an enterprise data warehouse, and a data mart layer along with Big Data technologies.
However, using data warehouses with relational or appliance technologies for traditional structured data poses challenges to real-time decision making. This approach is also not ideal for integrating unstructured data coming from varied sources such as social media sites, weblogs, and smart ATMs.
Here is an illustration of the change in trends of data-driven solutions in the past (left) and present day (right).
If we consider the example of present financial institutions alone, the amount of critical data generated is humongous. There is data such as the ones from 24×7 call centers, video chats with advisors and specialists, smart ATMs, and location-based data which eventually leads to vast amounts of complex, unstructured text, audio, and video data.
Handling such surge in data requires advanced or at least augmented data strategies and up-to-date ETL tools in the arsenal.
What Does It Take To Build One
Data ingestion has garnered much attention over the years. Attempts are being made to integrate multiple data generating sources to draw in more reliable insights. However, with vast amounts of data, drawing insights in real-time has become tedious.
To this end, to tackle the availability of large amounts of unlabeled data or the lack of it, there have been some novel works at the algorithmic level where neural networks were able to predict accurately with unlabeled data.
The deal with most ML models is the time taken to train them. The traditional data platforms also are prone to the hassles that training brings. Currently, there is a new trend of deploying pre-trained models using open-source frameworks and libraries.
A next generation platform not only has efficient data ingestion techniques but is also flexible enough to incorporate pretrained models into their workflow for faster results. Companies like LinkedIn and Uber have already been building in-house data management platforms that suit the challenges that usually associated with rapidly growing user base.
Another solution for traditional data ingestion challenges is to imbibe metadata-driven ingestion to building ingestion pipelines. THis is done by using a configurable set of attributes to define the common characteristics of data that determine ETL behavior across pipelines.
Practitioners also insist on using analytical sandboxes where the users are allowed to experiment with the data that is new instead of waiting for the data engineers to skim and curate it. Once done with drawing insights from this new data, the users can forward that to the engineers to integrate that with the main pipeline.
“You should be able to integrate the data sources quickly into your platform if you have to be agile in your business. There should be no limits on the scale. Scale in this context is both storage and compute. There should be no limit on either of these,” said Ninad Phatak, Data Architect at Amazon in his talk at recently-concluded analytics summit Cypher 2019.
Platforms As One Stop Solution
According to a report by Deloitte, industry experts believe that the following factors are to be considered for building next-generation platforms:
- Governance
- Use cases
- Infrastructure
- Privacy and security
- Tooling; and delivery approach
The introduction of cloud technologies has enabled next generation platforms to be implemented and configured before saving the set-up to an image so that it can be reused at a later date. This provides an organisation with the control to automate the deployment of a data platform with varying size and power at will.
The objective behind establishing new data platforms is to accommodate innovative environments that are flexible enough to nurture a fast-moving technology landscape with a proliferation of emerging use cases. The next-generation data platform should be a fusion of disruptive and state-of-the-art technology with delivery methods that are effective and offer scalability.
A modern data platform will serve as a central repository of all the data without any restrictions to data ingestion while also possessing the capabilities to perform both batch and real-time inferences.