How to build the next-generation data lake

The whole landscape of what you can do with data lakes is changing.

A data lake is a fundamental component in the digital transformation roadmap for most companies. But, the ability to get data ready and available for generating insights in a governed manner is still one of the most complex, costly and time-consuming processes. While data lakes have been around for many years now, new tools and technologies are evolving and a new set of capabilities are added to data lakes to make them more cost-effective and improve their adoption. 

Manish Gupta, head of Azure Big Data engineering practice at Tiger Analytics and Muthu Govindarajan, head of AWS & GCP Big Data, in the presentation titled,  next-generation data lake spoke about the trends in the industry such as self-service based data ingestion/processing and delivery; analytics workbench for business teams; automated data classification & PII discovery; low/no code approach for data modernisation & migration programs etc.

“The whole landscape of what you can do with data lakes is changing”, said Manish. Over the last few years, new capabilities have come up, said Muthu. He divided it into four segments:


Sign up for your weekly dose of what's up in emerging technology.

Segment 1: What’s been there?

  • Leveraging big data and cloud
  • DataOps- CI/CD automation and platform automation
  • Configurable reusable frameworks

Segment 2: What is getting better?

Download our Mobile App

  • Granular and efficient updates and deletes
  • Granular data access controls
  • Analytics workbench (playground, lab setup for data scientist with automated data provisioning on-demand)

Segment 3: what’s creating impact?

  • Configurable self service-based data ingestion/ processing and delivery
  • Low/ no-code approach for data modernisation & migration programs
  • Virtual lakehouse technologies
  • Data platform and observability 

Segment 4: Cool trends

  • Enterprise metadata knowledge graphs
  • Containerised data engineering workloads
  • Evolving hybrid cloud architectures patterns

Data lake management

Manish spoke in detail about improving the speed and agility of data lake management:

  1. Self-service data lake management
  2. Intelligent data catalogue and data discovery
  3. Improving SQL Query & BI performance 

“One of the key factors that can help bring agility in the supply chain is about bringing self-service capabilities to the data. This means enabling all types of users to easily manage and govern data in a cloud lake environment themselves,” Manish said. It is important to create a robust and scalable pipeline for supporting a data lake and bring in the right automation and self-service capabilities. The self-service capabilities bring in higher agility and operational efficiency, allow users to leverage best practices in a reusable way with governance at the core, democratise data and reduce the burden on IT teams. 

The first step is to create a cloud platform and infra services, on top of which reusable components are built, followed by data pipelines and orchestration, APIs for data and platform management, and lastly self-service interfaces and data lake users.

The goal of the business is to deliver data trusted by the user. “This is a step beyond data quality,” said Madhu. Observability works in three different stages. The first stage is basic data health monitoring, followed by advanced monitoring with prediction and data platform observability. Basic data health monitoring is a critical component in every data platform consisting of key capabilities like data catalogue, self-service data discovery, configurable tools, data health dashboards, case management, crowdsource data asset rating, alerts, and actions. 

He discussed how users could ensure quality and platform governance through two key ideas. 

  1. Data observability 
  2. Platform observability 

On the observability end, some value generators are important to make reliable data available on time to consumers. These include monitoring data flow and environment, monitoring performance, monitoring data security, analysing workloads, predicting issues and optimising reliable data delivery.


More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is AI sexist?

Genderify, launched in 2020, determines the gender of a user by analysing their name, username and email address using AI.