How to build the next-generation data lake

The whole landscape of what you can do with data lakes is changing.

A data lake is a fundamental component in the digital transformation roadmap for most companies. But, the ability to get data ready and available for generating insights in a governed manner is still one of the most complex, costly and time-consuming processes. While data lakes have been around for many years now, new tools and technologies are evolving and a new set of capabilities are added to data lakes to make them more cost-effective and improve their adoption. 

Manish Gupta, head of Azure Big Data engineering practice at Tiger Analytics and Muthu Govindarajan, head of AWS & GCP Big Data, in the presentation titled,  next-generation data lake spoke about the trends in the industry such as self-service based data ingestion/processing and delivery; analytics workbench for business teams; automated data classification & PII discovery; low/no code approach for data modernisation & migration programs etc.

“The whole landscape of what you can do with data lakes is changing”, said Manish. Over the last few years, new capabilities have come up, said Muthu. He divided it into four segments:

Segment 1: What’s been there?

  • Leveraging big data and cloud
  • DataOps- CI/CD automation and platform automation
  • Configurable reusable frameworks

Segment 2: What is getting better?

  • Granular and efficient updates and deletes
  • Granular data access controls
  • Analytics workbench (playground, lab setup for data scientist with automated data provisioning on-demand)

Segment 3: what’s creating impact?

  • Configurable self service-based data ingestion/ processing and delivery
  • Low/ no-code approach for data modernisation & migration programs
  • Virtual lakehouse technologies
  • Data platform and observability 

Segment 4: Cool trends

  • Enterprise metadata knowledge graphs
  • Containerised data engineering workloads
  • Evolving hybrid cloud architectures patterns

Data lake management

Manish spoke in detail about improving the speed and agility of data lake management:

  1. Self-service data lake management
  2. Intelligent data catalogue and data discovery
  3. Improving SQL Query & BI performance 

“One of the key factors that can help bring agility in the supply chain is about bringing self-service capabilities to the data. This means enabling all types of users to easily manage and govern data in a cloud lake environment themselves,” Manish said. It is important to create a robust and scalable pipeline for supporting a data lake and bring in the right automation and self-service capabilities. The self-service capabilities bring in higher agility and operational efficiency, allow users to leverage best practices in a reusable way with governance at the core, democratise data and reduce the burden on IT teams. 

The first step is to create a cloud platform and infra services, on top of which reusable components are built, followed by data pipelines and orchestration, APIs for data and platform management, and lastly self-service interfaces and data lake users.

The goal of the business is to deliver data trusted by the user. “This is a step beyond data quality,” said Madhu. Observability works in three different stages. The first stage is basic data health monitoring, followed by advanced monitoring with prediction and data platform observability. Basic data health monitoring is a critical component in every data platform consisting of key capabilities like data catalogue, self-service data discovery, configurable tools, data health dashboards, case management, crowdsource data asset rating, alerts, and actions. 

He discussed how users could ensure quality and platform governance through two key ideas. 

  1. Data observability 
  2. Platform observability 

On the observability end, some value generators are important to make reliable data available on time to consumers. These include monitoring data flow and environment, monitoring performance, monitoring data security, analysing workloads, predicting issues and optimising reliable data delivery.


Download our Mobile App

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox