MITB Banner

Databricks Breaks Data Warehousing Performance Record

An official blog that Databricks SQL has set a new record in 100TB TPC-DS by outperforming the previous best by 2.2 times.
Share
Databricks

San Francisco-based data warehouse and data technology company Databricks announced that it had created a world record for data warehouse performance. The company announced on an official blog that Databricks SQL has set a new record in 100TB TPC-DS by outperforming the previous best by 2.2 times. 100TB TPC-DS is a gold standard performance benchmark for data warehousing. The result has been formally audited and reviewed by the TPC council.

 New Record Created

Barcelona Supercomputing Center’s team corroborated the results of the new record. The group routinely runs TPC-DS on popular data warehouses. The group of researchers benchmarked Databricks and Snowflake and found that the former was 2.7 times faster and 12 times better than the latter in terms of price performance.

Defined by the non-profit organization Transaction Processing Performance Council (TPC), TPC-DS is a data warehouse benchmark where DS stands for decision support. It includes 99 queries of varying complexities that include simple aggregations to complex pattern mining. It was introduced in mid-2000 to reflect the growing complexity of analytics. Since then, almost all vendors have adopted TPC-DS as the de facto standard for data warehouses.

It is not very likely to pass the official benchmark as it considers various parameters. Databricks in its blog claims that several established vendors often tweak official benchmarks to demonstrate the better performance of their systems. The tweaks include removing certain SQL features like rollups and removing skew by changing data distribution. “The tweaks also ostensibly explain why most vendors seem to beat all other vendors according to their own benchmarks,” Databricks has claimed.

Databricks managed to address the following challenges:

Open vs proprietary data formats: It is argued that data warehouses that leverage proprietary data formats can evolve quickly compared to those that rely on open formats (for example — Databricks, which is based on Lakehouse, doesn’t change as quickly.) Databricks argues that the open format has its own advantages like the scope for standardization, defying vendor lock-in, and allowing tools to be developed independently of any vendor. The company also says that it is possible for open formats to evolve, case in point, Parquet, which has undergone several stages of iterations.

Architecture: Databricks doesn’t employ the Apache Spark-based MPP architecture, which is considered superior for SQL performance; instead, the Databricks SQL is based on Photon. It is built for SIMD architecture and does heavy parallel query processing. Photon can be considered as an MPP engine.

Throughput vs latency trade-off: Databricks has built some of its key enabling technologies built on Photon, Delta Lake, etc., which have improved the performance of both large and small queries.

Time: It is traditionally believed that it takes at least a decade or so for a database system to mature. Databricks managed to do it much faster due to factors like investing in various technologies that would support SQL workloads and benefit AI workloads on Databricks; use of SaaS model that accelerates software development cycle; better capital allocation.

Wrapping up

Notably, Databricks has been advancing its data warehousing capabilities. In November 2020, the company announced its full suite of data warehousing capabilities as Databricks SQL. The company says that the initial doubts about whether an open architecture based on a Lakehouse can offer the classical data warehouse’s performance, speed, and cost have been rubbished with the latest performance test.

The blog further stated that the company had assembled the best team on the market that is working to deliver the ‘next performance breakthrough’. The company is also working on a number of improvements on ease of use and governance. 

PS: The story was written using a keyboard.
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
Recent Stories

Featured

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

AIM Conference Calendar

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives. Revel in intimate events that encapsulate the heart and soul of the AI Industry.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed