Last updated November 21, 2017
In AI Origins & Evolution

Data Lake or Data Warehouse or Both?

Published on August 3, 2017
by Nav Kesher

Even if you are working somewhere remotely close to data technologies, you may have heard about ”Data Lakes”? Couple years ago, when we first heard the term, we visualized petabytes of data in one place and subsequently had the question – how is data lake different from data warehouse? Or isn’t data lake the new data warehouse 2.0?

Thats where we started researching and found this definition from the same person who came up with the term ‘Data Lake’. James Dixon, the founder and CTO of Pentaho, describes a data lake as –“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Analysts and consumers of traditional Data Warehouse were more dependent on developers and technologists to do the cleansing and transformation of data, while the new generation of data consumers are more enthusiastic about rolling up their sleeves and deep diving in the lake to uncover the hidden facts. In essence, these new generation data hackers need this big playground where they can play with the data in its native format. With new technological innovation like readily available compute and storage via. cloud and more technically skilled consumers (for example, data scientists), applying this new concept of data lakes helps organizations become nimble and more data oriented.

So the big question is – if data lake or data warehouse can suffice the need of an organization individually or do you need both?

Our take is ‘both’, keeping in mind the capabilities of these two platforms as they stand today. As things evolve, the idea, need and implementation of these platforms will change too. Today’s reality is that an appliance based data warehouse and a data lake are both optimized for different purposes, and the goal is to use each one for what they were designed to do.

Here are some things to consider while having these discussions in your organization –

Types of data and consumer needs – “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” On the contrary, a data warehouse only stores data that has been modeled/structured.A data warehouse follow a schema on write approach while data lake follows a schema on read approach.
Storage consideration – Data warehouse use appliance and they are expensive while data lake is supposed to be built on hadoop which is designed to be installed on low-cost commodity hardware.
Quick and Dirty/Slow and Exact– Data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
Security – Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when.
Cohorts of User base- For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. A data lake, at this point in its maturity, is best suited for the data scientists.

According to many market researches, the data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race. This data has value and we need storage and technology to persist years and years of data, do analytics, visualization and predictions. And to make this happen, most organizations are leaning towards a hybrid architecture, a more evolutionary model that has a smooth transition than a revolutionary model with disruptions. And this is also going to be the next installment of this topic (How to connect data warehouse and data lake?)

Co-author

Shweta Sinha leads the Data Warehouse and Devops efforts at Premera Blue Cross. She specializes in the creating and scaling data science and data engineering platform and infrastructure.

Access all our open Survey & Awards Nomination forms in one place >>

Nav Kesher

Nav Kesher is the Head of Platform Data Sciences at Facebook. He has vast experience in bringing ideas to life, from early ideation and planning to development and growth. Nav specializes in the Data Sciences and Analytics fields, and has held various analytics roles and is also recruiting for Data Scientists to come work at FB. He holds a B.S. in Engineering and an M.B.A. in general management.

Data Lake or Data Warehouse or Both?

So the big question is – if data lake or data warehouse can suffice the need of an organization individually or do you need both?

Nav Kesher

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru