Data Lake: What It Takes To Do It Right?

We are living in the age of digital transformation when voluminous data is getting created in diverse forms and shape. Every business is trying to derive more and more value out of the available data.

One of the focus areas in data modernization is the addition of Data Lakes in the new scheme of things.

So, what is a Data Lake?

A data lake is a collection of data, not a platform for data. These are usually managed on Hadoop, less often on RDBMS. A common myth states that data lakes require open source Apache Hadoop or a vendor distribution of Hadoop. It’s true that majority of data lake implementations are on Hadoop and these are called Hadoop based data lakes. However few data lakes are deployed atop RDBMS and these are called relational data lakes.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Salient features of data lakes

  • Handles large volumes of diverse (structured, semi/ un structured) data
  • Mostly detailed source data
  • Raw material for discovering entities and facts
  • Data prep on demand
  • Data repurposed later, as need arise
  • Typically schema on read
  • Persists data in its original raw state
  • Integrates into multiple enterprise data ecosystems and architectures

Although there are many potential benefits of data lake but my focus will be more on the key barriers to data lakes.

Barriers to Data Lakes and how to overcome them

a) Data lake design:

Download our Mobile App

In most of the scenarios, data warehouse architects design the data lake and get carried away with traditional approaches/ principles of data warehouse design. A data lake is not a data warehouse and a zone is not heavily structured like a subject area or a dimension.

Expect few zones, and within each zone, data is still in the raw format or slightly standardized. Typical zones are data landing, data staging, data domains (Example: HR or Customer data), departmental domains (Example: data used by marketers), analytics archives and analytics sandboxes. Once, the zones are decided, design data flow for moving data from one zone to another. Expect to revisit how data is organized in your data lake. In case of restructuring the data, the data should leave the lake and go to a more structured environment, such as a data warehouse of mart. One of the functions of the lake is to feed other databases.

b) Governance:

When a data lake is not managed and governed properly, it deteriorates into a data swamp. It becomes nearly impossible to navigate, trust and leverage the disorganized data store for organizational benefit.

This risk can be easily mitigated by bringing in proper collaborative data governance, curation and stewardship.

Data Governance: Data governance is usually enforced via people and processes. From people perspective data governance takes the form of a board or committee, having mix of data management professionals (who create enterprise standards of data) and business managers (who serve as data owners, stewards and curators with focus on compliance). All these people collaborate to establish and enforce policies that ensures data is compliant, secured, standardized and trusted.

Implementers (Technical teams) of Data Lake must work with their enterprise governance board so that the lake and its data complies with the established policies.

Data Stewardship: Data is an asset to the lake and it should be curated by a data steward who is responsible for driving improvements in the data. Best data stewards are business people (non-technical staff) because they can prioritize based on business need and keep data management work aligned with business goals. Priority should be given to metadata, data quality and data lineage.

c) Security:

Cyber attackers are now organized and well equipped with the tools and technology to rapidly extract high value data assets from enterprises.

Such risks and liabilities can be alleviated by implementing multi layers of security.

  • Data Lake needs standard protection in the forms of authentication and authorization.
  • It is useful to record an audit trail of access by users and tools. Operational metadata can enable such audits.
  • Unlike the user centric or application centric security mentioned above, data centric security layer operates on or near the data to cleanse, block and de-identify sensitive (personally identifiable information) or high value data. That way, when the data is stolen, the thief has nothing to sell or commit a crime with.

d) Availability of Technical resources:

There are very few data management professionals available who have prior experience with data lakes and Hadoop. The people who are available tend to command rather high salaries.

For these reasons, organizations should cross-train existing employees in these skills instead of hiring new folks in the team. This strategy works out well as it increases the value of employees and they are more engaged and committed.

As with any emerging technology, it will take time before data lakes reach to their full potential. But those who can start the journey now – strategically and with a long-term vision – stand to create an enormous competitive edge with the competitors that will be difficult to diminish in the years to come.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Nitin Srivastava
Nitin is a part of the AIM Writers Programme. He is an experienced data and analytics consultant. For the last two decades, he has been extensively working with large organisations in implementing and managing data warehouses and creating analytic solutions for various domains, predominantly in the BFSI sector.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.