How Airbnb’s Druid Platform Tackles The Big Data Problem

Brian Chesky co-founded peer-to-peer room and home rental company Airbnb with Nathan Blecharczyk and Joe Gebbia in 2008. Now, almost 11 years later, Airbnb has been used by more than 300 million people in 81,000 cities in 191 countries.

At Airbnb, dashboard analytics plays a crucial role in real-time tracking and monitoring of various aspects of business and systems. As a result, the timeliness of these dashboards is critical to the daily operation of Airbnb. And, Druid serves as the best platform for ingestion, streaming and handling of large volumes of data while ensuring there are no latencies.

Why Druid

Handling dashboard analytics in real time comes with following challenges:

  • Latency at data aggregation in the warehouse.
  • A large volume of data
  • Integration between systems. For example, Airbnb stores their datasets in Hadoop and use Kafka and Spark to process data streams.

So, Druid helps Airbnb in tackling these challenges and keeping the services afloat even during unwanted shutdowns.

A Brief Look At Druid’s Architecture

At Airbnb, two Druid clusters are running in production. Two separate clusters allow dedicated support for different uses, even though a single Druid cluster can handle more data sources than what we need. In total, we have 4 Brokers, 2 Overlords, 2 Coordinators, 8 Middle Managers, and 40 Historical nodes. In addition, our clusters are supported by one MySQL server and one ZooKeeper cluster with 5 nodes. Druid clusters are relatively small and low cost compared to other service clusters like HDFS and Presto.

Outline of Druid via Airbnb engineering blog

For dashboards using systems like Hive and Presto at query time, data aggregation takes a long time. The storage format makes it difficult for repeated slicing of the data.

The dashboards built on top of Druid can be noticeably faster than those built on others systems. Compared to Hive and Presto, Druid can be an order of magnitude faster.

With high inflows of data, any breakdown will take a toll on the systems. To make the systems breakdown-ready, Druid architecture is well separated out into different components for ingestion, serving, and overall coordination.

For batch analytics, Druid’s API allows easy access of data from Hadoop. Druid’s streaming client API, Tranquility, helps in the integration of streaming engines such as Samza with any other streaming engine. Whereas, real-time analytics is implemented through Spark.

Apache Superset, an open source data viz of Airbnb serves as the interface for users to visualize the results of their analytics queries.

Querying With Druid

Druid allows individual teams and data scientists to easily define how the data their application or service produces should be aggregated and exposed as a Druid data source.

Every data segment needs to be ingested from MapReduce jobs first before it is available for queries. Druid queries are much faster than other systems is that there is a cost imposed on ingestion. This works great in a write-once-read-multiple-times model, and the framework only needs to ingest new data on a daily basis.

Future Direction

One of the issues we deal with is the growth in the number of segment files that are produced every day that need to be loaded into the cluster. Segment files are the basic storage unit of Druid data, that contains the pre-aggregated data ready for serving.

Some data needs recomputation, resulting in a large number of segment files that need to be loaded at once onto the cluster. Ingested segments are loaded by coordinators sequentially in a single thread, centrally. As segments pile up, the delay will be longer as the coordinator will have a hard time keeping up with the pace.

The input volume of data to produce a larger segment is so high that the Hadoop job would run for too long crunching that data, and many times would fail due to various reasons.

One possible solution that Airbnb plans on working on is the compacting of segments right after ingestion and before they are handed off to the coordinator, and increase the segment size while keeping the ingestion stable.

That said, Druid is robust and resilient to failures. Most failures are transparent and unnoticeable to users. Even if a role that is a single point of failure (like Coordinator, Overlord, or even ZooKeeper) fails, Druid cluster is still able to provide query service to the users.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Victor Dey
Step By Step Guide To Data Analysis Using SweetViz

Sweetviz is an open-source Python library that helps generate beautiful, highly detailed visualizations to Exploratory Data Analysis with a single line of code. It also generates a summarised report and can help create interactive dashboards as well. The output generated is a fully self-contained HTML application.

Ritika Sagar
How Did Airbnb Achieve Model Metric Consistency At Scale

As part of the company’s commitment to producing a single source of truth metric platform, Airbnb made major investments in the development of Minerva, a platform that standardised the ways metrics are developed, computed, provided, and consumed.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM