Last updated February 28, 2024
In AI Origins & Evolution

How to build the next-generation data lake

The whole landscape of what you can do with data lakes is changing.

Share

Published on May 3, 2022

by Avi Gopani

A data lake is a fundamental component in the digital transformation roadmap for most companies. But, the ability to get data ready and available for generating insights in a governed manner is still one of the most complex, costly and time-consuming processes. While data lakes have been around for many years now, new tools and technologies are evolving and a new set of capabilities are added to data lakes to make them more cost-effective and improve their adoption.

Manish Gupta, head of Azure Big Data engineering practice at Tiger Analytics and Muthu Govindarajan, head of AWS & GCP Big Data, in the presentation titled, next-generation data lake spoke about the trends in the industry such as self-service based data ingestion/processing and delivery; analytics workbench for business teams; automated data classification & PII discovery; low/no code approach for data modernisation & migration programs etc.

“The whole landscape of what you can do with data lakes is changing”, said Manish. Over the last few years, new capabilities have come up, said Muthu. He divided it into four segments:

Segment 1: What’s been there?

Leveraging big data and cloud
DataOps- CI/CD automation and platform automation
Configurable reusable frameworks

Segment 2: What is getting better?

Granular and efficient updates and deletes
Granular data access controls
Analytics workbench (playground, lab setup for data scientist with automated data provisioning on-demand)

Segment 3: what’s creating impact?

Configurable self service-based data ingestion/ processing and delivery
Low/ no-code approach for data modernisation & migration programs
Virtual lakehouse technologies
Data platform and observability

Segment 4: Cool trends

Enterprise metadata knowledge graphs
Containerised data engineering workloads
Evolving hybrid cloud architectures patterns

Data lake management

Manish spoke in detail about improving the speed and agility of data lake management:

Self-service data lake management
Intelligent data catalogue and data discovery
Improving SQL Query & BI performance

“One of the key factors that can help bring agility in the supply chain is about bringing self-service capabilities to the data. This means enabling all types of users to easily manage and govern data in a cloud lake environment themselves,” Manish said. It is important to create a robust and scalable pipeline for supporting a data lake and bring in the right automation and self-service capabilities. The self-service capabilities bring in higher agility and operational efficiency, allow users to leverage best practices in a reusable way with governance at the core, democratise data and reduce the burden on IT teams.

The first step is to create a cloud platform and infra services, on top of which reusable components are built, followed by data pipelines and orchestration, APIs for data and platform management, and lastly self-service interfaces and data lake users.

The goal of the business is to deliver data trusted by the user. “This is a step beyond data quality,” said Madhu. Observability works in three different stages. The first stage is basic data health monitoring, followed by advanced monitoring with prediction and data platform observability. Basic data health monitoring is a critical component in every data platform consisting of key capabilities like data catalogue, self-service data discovery, configurable tools, data health dashboards, case management, crowdsource data asset rating, alerts, and actions.

He discussed how users could ensure quality and platform governance through two key ideas.

Data observability
Platform observability

On the observability end, some value generators are important to make reliable data available on time to consumers. These include monitoring data flow and environment, monitoring performance, monitoring data security, analysing workloads, predicting issues and optimising reliable data delivery.

Access all our open Survey & Awards Nomination forms in one place

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Why is Microsoft Copilot Employees' Worst Nightmare?

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

12 Indian GenAI Startups Building Insane Products You Should Know About

Sukriti Gupta

From building AI agents that can converse in Indic languages to developing AI chatbots that help UPSC aspirants prep better, these Indian startups can do it all!