Listen to this story
At the Data + AI summit 2022, Databricks announced that it would open-source all Delta Lake APIs as a part of the Delta Lake 2.0 release. Further, Databricks will contribute all features and enhancements it has made to Delta Lake, including capabilities that were hitherto only available in Databricks to the Linux Foundation.
Databricks is the world’s first lakehouse platform in the cloud. Delta Lake is an open format storage layer that brings reliability to data lakes and provides ACID (atomicity, consistency, isolation and durability) transactions, scalable metadata handling, and unifies streaming and batch data processing. The announcement comes at a time when several competitors have cast aspersions on the ‘open sourceness’ of Delta Lake.
Sign up for your weekly dose of what's up in emerging technology.
Is Delta Lake truly open source?
In January 2022, James Malone, senior manager of Product Management, Snowflake, took an indirect jab at DeltaLake. “Many data architectures can benefit from a table format, and in my view, #ApacheIceberg is the one to choose – it’s (actually) open, has a vibrant and growing ecosystem, and is designed for interoperability,” he said.
Databricks initially launched Delta Lake as an open-source project in 2019. However, many of its features added later were proprietary and available only to Databricks’s customers.
Why such a move now?
According to Databricks, the level of support Delta Lake has received from contributors outside Databricks is the driving force behind open-sourcing all of Delta Lake. There are more than 190 contributors across 70 plus organisations, with almost two-thirds of them coming from leading companies like Apple, IBM, Microsoft, Disney, Amazon, and eBay. Over the past few years, Delta Lake has seen a 663% increase in contributor strength.
Source: Linux Foundation
“From the beginning, Databricks has been committed to open standards and the open-source community. We have created, contributed to, fostered the growth of, and donated some of the most impactful innovations in modern open source technology,” said Ali Ghodsi, co-founder and CEO of Databricks. “Open data lakehouses are quickly becoming the standard for how the most innovative companies handle their data and AI. Delta Lake, MLflow and Spark are all core to this architectural transformation, and we’re proud to do our part in accelerating their innovation and adoption.”
Meanwhile, some speculate the feud between Databricks and Snowflake could be the reason for the open source move. Last November, Databricks published a blog–based on research from Barcelona Supercomputing Center–claiming Databricks SQL was 2.7x faster and 12x better in terms of price-performance compared to a similarly sized Snowflake setup. In response, Snowflake published a blog post claiming “the Snowflake results that it published were not transparent, audited, or reproducible. And, those results are wildly incongruent with our internal benchmarks and our customers’ experiences”.
“We ran the TPC-DS power run in our AWS-US-WEST cloud region. The entire power run consists of running 99 queries against the 100 TB scale TPC-DS database. Out of the box, all the queries execute on a 4XL warehouse in 3,760s, using the best elapsed time of two successive runs. This is more than two times faster than what Databricks has reported as the Snowflake result,” the blog added.
Later, Databricks published another blog claiming the improved performance was due to Snowflake’s pre-baked TPC-DS dataset, created two days after the announcement of the results.
Recently, Databricks launched dedicated lakehouses for retail, financial services and healthcare and life sciences to create an industry-specific cloud-backed platform for data management, analytics and advanced AI. Industry-specific lakehouses enable organisations to leverage data easily and accelerate the development of more advanced, data-driven solutions. Shortly after, Snowflake came up with dedicated Data Clouds for healthcare and life sciences and retail.
The increasing popularity of Apache Iceberg and the entry of other open-source data lakehouse projects have also been cited as other major drivers behind the open sourcing of Delta Lake. Apache Iceberg is a high-performance format for huge analytic tables that brings the reliability and simplicity of SQL tables to big data.
Major organisations like Snowflake, AWS, Adobe Experience Cloud and Dremio have taken a shine to Apache Iceberg. In 2021, AWS announced Athena support and EMR support for Apache Iceberg. In January 2022, Snowflake announced the adoption of Apache Iceberg. In April 2022, Google Cloud announced the preview of BigLake, a new data lake storage engine that supports Delta Lake and Apache Iceberg data table formats.
On June 30th 2022, Cloudera announced its support for Apache Iceberg.
Last February, Onehouse arrived on the market. Onehouse delivers a new bedrock for data through a cloud-native, fully-managed lakehouse service built on Apache Hudi. Onehouse combines a data lake’s scale with a data warehouse’s convenience.
Major announcements at the summit
Databricks announced the release of MLflow 2.0. MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. MLflow 2.0 comes with MLflow Pipelines. MLflow pipelines provide data scientists with pre-defined, production-ready templates based on the model type they’re building. These templates help data scientists bootstrap and accelerate model development without needing intervention from production engineers.
Spark is a large-scale data analytics engine that can scale up easily. However, due to the lack of remote connectivity, it could not be used for modern data applications. To address this, Databricks introduced Spark Connect, a client and server interface for Apache Spark based on the DataFrame API. With Spark Connect, users can access Spark from any device.