MITB Banner

Databricks gives away its data Lake product for free, but why?

Delta Lake is an open format storage layer that brings reliability to data lakes.

Share

Listen to this story

At the Data + AI summit 2022, Databricks announced that it would open-source all Delta Lake APIs as a part of the Delta Lake 2.0 release. Further, Databricks will contribute all features and enhancements it has made to Delta Lake, including capabilities that were hitherto only available in Databricks to the Linux Foundation. 

Databricks is the world’s first lakehouse platform in the cloud. Delta Lake is an open format storage layer that brings reliability to data lakes and provides ACID (atomicity, consistency, isolation and durability) transactions, scalable metadata handling, and unifies streaming and batch data processing. The announcement comes at a time when several competitors have cast aspersions on the ‘open sourceness’ of Delta Lake.

Is Delta Lake truly open source?

In January 2022, James Malone, senior manager of Product Management, Snowflake, took an indirect jab at DeltaLake. “Many data architectures can benefit from a table format, and in my view, #ApacheIceberg is the one to choose – it’s (actually) open, has a vibrant and growing ecosystem, and is designed for interoperability,” he said. 

https://www.linkedin.com/embed/feed/update/urn:li:share:6914288063321952257

Databricks initially launched Delta Lake as an open-source project in 2019. However, many of its features added later were proprietary and available only to Databricks’s customers. 

Why such a move now?

According to Databricks, the level of support Delta Lake has received from contributors outside Databricks is the driving force behind open-sourcing all of Delta Lake. There are more than 190 contributors across 70 plus organisations, with almost two-thirds of them coming from leading companies like Apple, IBM, Microsoft, Disney, Amazon, and eBay. Over the past few years, Delta Lake has seen a 663% increase in contributor strength. 

Source: Linux Foundation

“From the beginning, Databricks has been committed to open standards and the open-source community. We have created, contributed to, fostered the growth of, and donated some of the most impactful innovations in modern open source technology,” said Ali Ghodsi, co-founder and CEO of Databricks. “Open data lakehouses are quickly becoming the standard for how the most innovative companies handle their data and AI. Delta Lake, MLflow and Spark are all core to this architectural transformation, and we’re proud to do our part in accelerating their innovation and adoption.” 

Meanwhile, some speculate the feud between Databricks and Snowflake could be the reason for the open source move. Last November, Databricks published a blog–based on research from Barcelona Supercomputing Center–claiming Databricks SQL was 2.7x faster and 12x better in terms of price-performance compared to a similarly sized Snowflake setup. In response, Snowflake published a blog post claiming “the Snowflake results that it published were not transparent, audited, or reproducible. And, those results are wildly incongruent with our internal benchmarks and our customers’ experiences”.

“We ran the TPC-DS power run in our AWS-US-WEST cloud region. The entire power run consists of running 99 queries against the 100 TB scale TPC-DS database. Out of the box, all the queries execute on a 4XL warehouse in 3,760s, using the best elapsed time of two successive runs. This is more than two times faster than what Databricks has reported as the Snowflake result,” the blog added. 

Later, Databricks published another blog claiming the improved performance was due to Snowflake’s pre-baked TPC-DS dataset, created two days after the announcement of the results.

Recently, Databricks launched dedicated lakehouses for retail, financial services and healthcare and life sciences to create an industry-specific cloud-backed platform for data management, analytics and advanced AI. Industry-specific lakehouses enable organisations to leverage data easily and accelerate the development of more advanced, data-driven solutions. Shortly after, Snowflake came up with dedicated Data Clouds for healthcare and life sciences and retail.

The increasing popularity of Apache Iceberg and the entry of other open-source data lakehouse projects have also been cited as other major drivers behind the open sourcing of Delta Lake. Apache Iceberg is a high-performance format for huge analytic tables that brings the reliability and simplicity of SQL tables to big data.

Major organisations like Snowflake, AWS, Adobe Experience Cloud and Dremio have taken a shine to Apache Iceberg. In 2021, AWS announced Athena support and EMR support for Apache Iceberg. In January 2022, Snowflake announced the adoption of Apache Iceberg. In April 2022, Google Cloud announced the preview of BigLake, a new data lake storage engine that supports Delta Lake and Apache Iceberg data table formats. 

On June 30th 2022, Cloudera announced its support for Apache Iceberg.

Last February, Onehouse arrived on the market. Onehouse delivers a new bedrock for data through a cloud-native, fully-managed lakehouse service built on Apache Hudi. Onehouse combines a data lake’s scale with a data warehouse’s convenience.

Major announcements at the summit

Databricks announced the release of MLflow 2.0. MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. MLflow 2.0 comes with MLflow Pipelines. MLflow pipelines provide data scientists with pre-defined, production-ready templates based on the model type they’re building. These templates help data scientists bootstrap and accelerate model development without needing intervention from production engineers.

Spark is a large-scale data analytics engine that can scale up easily. However, due to the lack of remote connectivity, it could not be used for modern data applications. To address this, Databricks introduced Spark Connect, a client and server interface for Apache Spark based on the DataFrame API. With Spark Connect, users can access Spark from any device. 

Share
Picture of Zinnia Banerjee

Zinnia Banerjee

Zinnia loves writing and it is this love that has brought her to the field of tech journalism.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.