San Francisco-based data warehouse and data technology company Databricks announced that it had created a world record for data warehouse performance. The company announced on an official blog that Databricks SQL has set a new record in 100TB TPC-DS by outperforming the previous best by 2.2 times. 100TB TPC-DS is a gold standard performance benchmark for data warehousing. The result has been formally audited and reviewed by the TPC council.
New Record Created
Barcelona Supercomputing Center’s team corroborated the results of the new record. The group routinely runs TPC-DS on popular data warehouses. The group of researchers benchmarked Databricks and Snowflake and found that the former was 2.7 times faster and 12 times better than the latter in terms of price performance.
Defined by the non-profit organization Transaction Processing Performance Council (TPC), TPC-DS is a data warehouse benchmark where DS stands for decision support. It includes 99 queries of varying complexities that include simple aggregations to complex pattern mining. It was introduced in mid-2000 to reflect the growing complexity of analytics. Since then, almost all vendors have adopted TPC-DS as the de facto standard for data warehouses.
It is not very likely to pass the official benchmark as it considers various parameters. Databricks in its blog claims that several established vendors often tweak official benchmarks to demonstrate the better performance of their systems. The tweaks include removing certain SQL features like rollups and removing skew by changing data distribution. “The tweaks also ostensibly explain why most vendors seem to beat all other vendors according to their own benchmarks,” Databricks has claimed.
Databricks managed to address the following challenges:
Open vs proprietary data formats: It is argued that data warehouses that leverage proprietary data formats can evolve quickly compared to those that rely on open formats (for example — Databricks, which is based on Lakehouse, doesn’t change as quickly.) Databricks argues that the open format has its own advantages like the scope for standardization, defying vendor lock-in, and allowing tools to be developed independently of any vendor. The company also says that it is possible for open formats to evolve, case in point, Parquet, which has undergone several stages of iterations.
Architecture: Databricks doesn’t employ the Apache Spark-based MPP architecture, which is considered superior for SQL performance; instead, the Databricks SQL is based on Photon. It is built for SIMD architecture and does heavy parallel query processing. Photon can be considered as an MPP engine.
Throughput vs latency trade-off: Databricks has built some of its key enabling technologies built on Photon, Delta Lake, etc., which have improved the performance of both large and small queries.
Time: It is traditionally believed that it takes at least a decade or so for a database system to mature. Databricks managed to do it much faster due to factors like investing in various technologies that would support SQL workloads and benefit AI workloads on Databricks; use of SaaS model that accelerates software development cycle; better capital allocation.
Wrapping up
Notably, Databricks has been advancing its data warehousing capabilities. In November 2020, the company announced its full suite of data warehousing capabilities as Databricks SQL. The company says that the initial doubts about whether an open architecture based on a Lakehouse can offer the classical data warehouse’s performance, speed, and cost have been rubbished with the latest performance test.
The blog further stated that the company had assembled the best team on the market that is working to deliver the ‘next performance breakthrough’. The company is also working on a number of improvements on ease of use and governance.