A data lake is a fundamental component in the digital transformation roadmap for most companies. But, the ability to get data ready and available for generating insights in a governed manner is still one of the most complex, costly and time-consuming processes. While data lakes have been around for many years now, new tools and technologies are evolving and a new set of capabilities are added to data lakes to make them more cost-effective and improve their adoption.
Manish Gupta, head of Azure Big Data engineering practice at Tiger Analytics and Muthu Govindarajan, head of AWS & GCP Big Data, in the presentation titled, next-generation data lake spoke about the trends in the industry such as self-service based data ingestion/processing and delivery; analytics workbench for business teams; automated data classification & PII discovery; low/no code approach for data modernisation & migration programs etc.
“The whole landscape of what you can do with data lakes is changing”, said Manish. Over the last few years, new capabilities have come up, said Muthu. He divided it into four segments:
Segment 1: What’s been there?
- Leveraging big data and cloud
- DataOps- CI/CD automation and platform automation
- Configurable reusable frameworks
Segment 2: What is getting better?
- Granular and efficient updates and deletes
- Granular data access controls
- Analytics workbench (playground, lab setup for data scientist with automated data provisioning on-demand)
Segment 3: what’s creating impact?
- Configurable self service-based data ingestion/ processing and delivery
- Low/ no-code approach for data modernisation & migration programs
- Virtual lakehouse technologies
- Data platform and observability
Segment 4: Cool trends
- Enterprise metadata knowledge graphs
- Containerised data engineering workloads
- Evolving hybrid cloud architectures patterns
Data lake management
Manish spoke in detail about improving the speed and agility of data lake management:
- Self-service data lake management
- Intelligent data catalogue and data discovery
- Improving SQL Query & BI performance
“One of the key factors that can help bring agility in the supply chain is about bringing self-service capabilities to the data. This means enabling all types of users to easily manage and govern data in a cloud lake environment themselves,” Manish said. It is important to create a robust and scalable pipeline for supporting a data lake and bring in the right automation and self-service capabilities. The self-service capabilities bring in higher agility and operational efficiency, allow users to leverage best practices in a reusable way with governance at the core, democratise data and reduce the burden on IT teams.
The first step is to create a cloud platform and infra services, on top of which reusable components are built, followed by data pipelines and orchestration, APIs for data and platform management, and lastly self-service interfaces and data lake users.
The goal of the business is to deliver data trusted by the user. “This is a step beyond data quality,” said Madhu. Observability works in three different stages. The first stage is basic data health monitoring, followed by advanced monitoring with prediction and data platform observability. Basic data health monitoring is a critical component in every data platform consisting of key capabilities like data catalogue, self-service data discovery, configurable tools, data health dashboards, case management, crowdsource data asset rating, alerts, and actions.
He discussed how users could ensure quality and platform governance through two key ideas.
- Data observability
- Platform observability
On the observability end, some value generators are important to make reliable data available on time to consumers. These include monitoring data flow and environment, monitoring performance, monitoring data security, analysing workloads, predicting issues and optimising reliable data delivery.
REGISTER HERE TO ACCESS THE CONTENT