MITB Banner

Pinterest Open-Sources Big Data Analytics Tool Querybook: Major Features & Set Up Instructions

Querybook’s core focus is to make composing queries, creating analyses, and collaborating with others as simple as possible.
Share

Recently, Pinterest open-sourced its big data analytics tool Querybook that started its life as an intern project in 2017. Querybook auto analyses executed queries to provide data lineage, example queries, frequent user information, and search/auto-completion ranking. In 2018, the big data tool was released internally and soon became the official solution to query big data at Pinterest.

According to its developers, Querybook is built to provide a simple web UI for big data analysis to discover the right data, compose queries, and share findings. 

Behind Querybook

Querybook is a Big Data IDE that allows users to create, discover, and share data analyses, queries, and tables. The tool includes Database, Redis, ElasticSearch and remote storage. The Database is used to store the DataDocs, and MySQL is recommended. Redis is required to send async tasks to workers, maintain multi-server WebSocket connections, and caching live data for collaborative editing. ElasticSearch provides search functionality for database documents such as DataDocs and tables. Lastly, the remote storage stores the query results.

Querybook has three key components-

  • Web Server: The Web Server is used to handle HTTP requests, send/receive Websocket messages, and provide the static assets for the web.
  • Worker: This component is mainly used to execute long-running queries and scheduled DataDocs. It can also be used for auxiliary tasks such as updating ElasticSearch docs or analysing query lineage. 
  • Scheduler: Scheduler reads the task schedule from the database and sends it to the Celery workers.

Features

  • Querybook compose queries with autocompletion and hovering tooltip.
  • The tool uses both scheduling and charting in DataDocs to build dashboards.
  • Querybook has built-in rich-text support for users. 
  • The tool allows live query collaborations. 
  • Users can add additional documentation to the tables.
  • Using this tool, one can get lineage, sample queries, frequent user, search ranking based on past query runs.

How To Use It

According to its developers, there are two main ways to set up the big data analytics tool.

Single-Machine Instant Setup (locally or on a server): 

The single machine method is a quick way to try out Querybook for less than five users. This method uses docker-compose to bring up all the necessary infrastructure, which is why Docker needs to be installed for quick setup.

For installation, open terminal and run the following: 

git clone https://github.com/pinterest/querybook.git

cd querybook

Now run the following: 

make

Multi-Machine Setup

The multi-machine setup is required when someone wants to scale Querybook for thousands of users. The multi-machine set up runs Querybook containers on different machines/pods. This method is more complicated than the single machine instant set up and requires external infrastructure. The set up includes the following steps-

Step-1: Requirements

  • A MySQL/PostgresSQL[^1] database with version >=5.7. It is recommended to have more than 5GB of space.
  • An Elasticsearch server with version 6.6.1.
  • A 2GB Redis instance, Querybook should not use more than 1GB of memory.
  • If OAuth will be used for authentication, remember to get the OAuth client information (secrets, token url, etc).
  • For notifications, you would need either a Slack API Token or an email address and the email server running on port 25 of the web server. 

Step-2: Choose the instances

  • In this step, users will need to deploy three different services for Querybook. The web servers handle the HTTP/WebSocket traffic, the workers handle the async tasks such as running the query, and the scheduler sends scheduled tasks to the workers. Also, it is important to make sure to only have one instance of scheduler running to prevent duplication in scheduled tasks and have at least two workers for rolling restart deployments.

Step-3: Update your environment variables configuration 

Step-4: Start each service

You can start each service by the following commands:

  • Webserver: make web
  • Celery worker: make worker
  • Scheduler: make scheduler

Wrapping Up

To make it generic while preserving some of the Pinterest-specific integrations, the developers decided to have a two-layer organisation through a plugin system and add an Admin UI. The Admin UI allows organisations to configure the query engines, table metadata ingestion, and access permissions from a single friendly interface. The plugin system integrates Querybook with the internal systems at Pinterest by utilising Python’s importlib.

PS: The story was written using a keyboard.
Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India