How to build the next-generation data lake

The whole landscape of what you can do with data lakes is changing.

Share

A data lake is a fundamental component in the digital transformation roadmap for most companies. But, the ability to get data ready and available for generating insights in a governed manner is still one of the most complex, costly and time-consuming processes. While data lakes have been around for many years now, new tools and technologies are evolving and a new set of capabilities are added to data lakes to make them more cost-effective and improve their adoption. 

Manish Gupta, head of Azure Big Data engineering practice at Tiger Analytics and Muthu Govindarajan, head of AWS & GCP Big Data, in the presentation titled,  next-generation data lake spoke about the trends in the industry such as self-service based data ingestion/processing and delivery; analytics workbench for business teams; automated data classification & PII discovery; low/no code approach for data modernisation & migration programs etc.

“The whole landscape of what you can do with data lakes is changing”, said Manish. Over the last few years, new capabilities have come up, said Muthu. He divided it into four segments:

Segment 1: What’s been there?

  • Leveraging big data and cloud
  • DataOps- CI/CD automation and platform automation
  • Configurable reusable frameworks

Segment 2: What is getting better?

  • Granular and efficient updates and deletes
  • Granular data access controls
  • Analytics workbench (playground, lab setup for data scientist with automated data provisioning on-demand)

Segment 3: what’s creating impact?

  • Configurable self service-based data ingestion/ processing and delivery
  • Low/ no-code approach for data modernisation & migration programs
  • Virtual lakehouse technologies
  • Data platform and observability 

Segment 4: Cool trends

  • Enterprise metadata knowledge graphs
  • Containerised data engineering workloads
  • Evolving hybrid cloud architectures patterns

Data lake management

Manish spoke in detail about improving the speed and agility of data lake management:

  1. Self-service data lake management
  2. Intelligent data catalogue and data discovery
  3. Improving SQL Query & BI performance 

“One of the key factors that can help bring agility in the supply chain is about bringing self-service capabilities to the data. This means enabling all types of users to easily manage and govern data in a cloud lake environment themselves,” Manish said. It is important to create a robust and scalable pipeline for supporting a data lake and bring in the right automation and self-service capabilities. The self-service capabilities bring in higher agility and operational efficiency, allow users to leverage best practices in a reusable way with governance at the core, democratise data and reduce the burden on IT teams. 

The first step is to create a cloud platform and infra services, on top of which reusable components are built, followed by data pipelines and orchestration, APIs for data and platform management, and lastly self-service interfaces and data lake users.

The goal of the business is to deliver data trusted by the user. “This is a step beyond data quality,” said Madhu. Observability works in three different stages. The first stage is basic data health monitoring, followed by advanced monitoring with prediction and data platform observability. Basic data health monitoring is a critical component in every data platform consisting of key capabilities like data catalogue, self-service data discovery, configurable tools, data health dashboards, case management, crowdsource data asset rating, alerts, and actions. 

He discussed how users could ensure quality and platform governance through two key ideas. 

  1. Data observability 
  2. Platform observability 

On the observability end, some value generators are important to make reliable data available on time to consumers. These include monitoring data flow and environment, monitoring performance, monitoring data security, analysing workloads, predicting issues and optimising reliable data delivery.

REGISTER HERE TO ACCESS THE CONTENT

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India