Sumit Jindal, director of data engineering and Rashmi Purbey, manager of data engineering, from Publicis Sapient, spoke in detail about the evolution of modern data architecture and application at the Data Engineering Summit 2022. The duo unpacked the data buzzwords doing the rounds in the world of AI and data science in an information-packed session.
Sumit started off with an example of an online and offline multiformat retail and financial giant with a presence in multiple countries. The client needed multi-language support. “We had to build data for multi-business units spanning across multiple business domains, and also account for country dimensions,” he said.
Sign up for your weekly dose of what's up in emerging technology.
“We enable the data platform, which is seamlessly working on diverse data set from different business units and countries. And the outcome of this system helps our clients become digitally integrated enterprises,” he added.
A modular view
“This is a logical view of a modern data platform. What we are seeing here is the data is coming in different formats such as structured data, unstructured data, semi-structured data etc. The data can be integrated through APIs such as on-demand at batch loads, direct integration through databases, and we can have real-time streaming data as a prerequisite for the system,” Sumit said.
The data collection layer should have the functionality of combining or consuming data from different sources. It should have a data provenance layer where you can go back and see if something is wrong. Apart from this, storage is a fundamental point of a data platform.
Evolution of data architecture
Sumit said the first generation platforms were based on a data warehouse model. The data was integrated with ETL tools like Informatica, data-stage, etc and databases were integrated in a batch fashion. A data warehouse could be a sequel based data system such as Oracle, Teradata, etc which were a bit more performant in terms of doing ad-hoc BI queries. The process of data integration faced limitations such as storage and compute power. Normally, a top-down approach is used while building such a data warehouse.
The tier-two architecture is a modern method of data warehousing. With the advent of systems like Hadoop and Spark, a data lake based model has emerged. Now, in a data warehouse, the storage of unstructured data is particularly challenging. Frequent data updates or data injections pose another bottleneck.
“The two sections of a two-tier architecture are: First, a data lake layer, where you are processing your data. You are first loading data from multiple sources. And the second is data transformation where you are transforming data and making it available for ML as well as analytics use case and for a lot of ad hoc analytics,” he said.
The advantage is you can have multi-modal data available in all formats. However, the inconsistency or staleness of data is an issue.
Rashmi Purbey spoke about the applications of data lakehouse on various cloud systems.
- Lakehouse on Databricks (Azure as cloud platform)
- Lakehouse on AWS
- BigLake – Lakehouse on GCP
- Lakehouse using Snowflake
Databricks combines the best of both worlds, data warehousing and data lake. Lakehouse (AWS) provides a stable interface layer to query the data from both the data warehouse as well as the data lake. BigLake is a storage engine that allows organisations to unify data warehouses and lakes, and enables them to perform uniform fine-grained access control, and accelerate query performance across multi-cloud storage and open formats. Snowflake is a data warehouse built for the cloud. It enables the data-driven enterprise with instant elasticity, secure data sharing, and per-second pricing. Snowflake combines the power of data warehousing, the flexibility of big data platforms and the elasticity of the cloud at a fraction of the cost of traditional solutions.
“When building a resilient, scalable data platform, businesses normally focus on the platform that they are building, rather than concentrating on the analytics that goes behind building such a platform. Apart from that, one has to even consider the data being generated as a product in itself as there is a demand for such data in the market. One needs to keep enhancing and improving it to keep its quality up. Having the right quality checks and monitoring the output is of paramount importance in building a robust data product,” said Sumit.