Pinterest Open-Sources Big Data Analytics Tool Querybook: Major Features & Set Up Instructions

Querybook’s core focus is to make composing queries, creating analyses, and collaborating with others as simple as possible.

Recently, Pinterest open-sourced its big data analytics tool Querybook that started its life as an intern project in 2017. Querybook auto analyses executed queries to provide data lineage, example queries, frequent user information, and search/auto-completion ranking. In 2018, the big data tool was released internally and soon became the official solution to query big data at Pinterest.

According to its developers, Querybook is built to provide a simple web UI for big data analysis to discover the right data, compose queries, and share findings. 

Behind Querybook

Querybook is a Big Data IDE that allows users to create, discover, and share data analyses, queries, and tables. The tool includes Database, Redis, ElasticSearch and remote storage. The Database is used to store the DataDocs, and MySQL is recommended. Redis is required to send async tasks to workers, maintain multi-server WebSocket connections, and caching live data for collaborative editing. ElasticSearch provides search functionality for database documents such as DataDocs and tables. Lastly, the remote storage stores the query results.


Sign up for your weekly dose of what's up in emerging technology.

Querybook has three key components-

  • Web Server: The Web Server is used to handle HTTP requests, send/receive Websocket messages, and provide the static assets for the web.
  • Worker: This component is mainly used to execute long-running queries and scheduled DataDocs. It can also be used for auxiliary tasks such as updating ElasticSearch docs or analysing query lineage. 
  • Scheduler: Scheduler reads the task schedule from the database and sends it to the Celery workers.


  • Querybook compose queries with autocompletion and hovering tooltip.
  • The tool uses both scheduling and charting in DataDocs to build dashboards.
  • Querybook has built-in rich-text support for users. 
  • The tool allows live query collaborations. 
  • Users can add additional documentation to the tables.
  • Using this tool, one can get lineage, sample queries, frequent user, search ranking based on past query runs.

How To Use It

According to its developers, there are two main ways to set up the big data analytics tool.

Single-Machine Instant Setup (locally or on a server): 

The single machine method is a quick way to try out Querybook for less than five users. This method uses docker-compose to bring up all the necessary infrastructure, which is why Docker needs to be installed for quick setup.

For installation, open terminal and run the following: 

git clone

cd querybook

Now run the following: 


Multi-Machine Setup

The multi-machine setup is required when someone wants to scale Querybook for thousands of users. The multi-machine set up runs Querybook containers on different machines/pods. This method is more complicated than the single machine instant set up and requires external infrastructure. The set up includes the following steps-

Step-1: Requirements

  • A MySQL/PostgresSQL[^1] database with version >=5.7. It is recommended to have more than 5GB of space.
  • An Elasticsearch server with version 6.6.1.
  • A 2GB Redis instance, Querybook should not use more than 1GB of memory.
  • If OAuth will be used for authentication, remember to get the OAuth client information (secrets, token url, etc).
  • For notifications, you would need either a Slack API Token or an email address and the email server running on port 25 of the web server. 

Step-2: Choose the instances

  • In this step, users will need to deploy three different services for Querybook. The web servers handle the HTTP/WebSocket traffic, the workers handle the async tasks such as running the query, and the scheduler sends scheduled tasks to the workers. Also, it is important to make sure to only have one instance of scheduler running to prevent duplication in scheduled tasks and have at least two workers for rolling restart deployments.

Step-3: Update your environment variables configuration 

Step-4: Start each service

You can start each service by the following commands:

  • Webserver: make web
  • Celery worker: make worker
  • Scheduler: make scheduler

Wrapping Up

To make it generic while preserving some of the Pinterest-specific integrations, the developers decided to have a two-layer organisation through a plugin system and add an Admin UI. The Admin UI allows organisations to configure the query engines, table metadata ingestion, and access permissions from a single friendly interface. The plugin system integrates Querybook with the internal systems at Pinterest by utilising Python’s importlib.

More Great AIM Stories

Ambika Choudhury
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM