Now Reading
Pinterest Open-Sources Big Data Analytics Tool Querybook: Major Features & Set Up Instructions

Pinterest Open-Sources Big Data Analytics Tool Querybook: Major Features & Set Up Instructions

Ambika Choudhury
  • Querybook’s core focus is to make composing queries, creating analyses, and collaborating with others as simple as possible.

Recently, Pinterest open-sourced its big data analytics tool Querybook that started its life as an intern project in 2017. Querybook auto analyses executed queries to provide data lineage, example queries, frequent user information, and search/auto-completion ranking. In 2018, the big data tool was released internally and soon became the official solution to query big data at Pinterest.

According to its developers, Querybook is built to provide a simple web UI for big data analysis to discover the right data, compose queries, and share findings. 

Behind Querybook

Querybook is a Big Data IDE that allows users to create, discover, and share data analyses, queries, and tables. The tool includes Database, Redis, ElasticSearch and remote storage. The Database is used to store the DataDocs, and MySQL is recommended. Redis is required to send async tasks to workers, maintain multi-server WebSocket connections, and caching live data for collaborative editing. ElasticSearch provides search functionality for database documents such as DataDocs and tables. Lastly, the remote storage stores the query results.

Querybook has three key components-

  • Web Server: The Web Server is used to handle HTTP requests, send/receive Websocket messages, and provide the static assets for the web.
  • Worker: This component is mainly used to execute long-running queries and scheduled DataDocs. It can also be used for auxiliary tasks such as updating ElasticSearch docs or analysing query lineage. 
  • Scheduler: Scheduler reads the task schedule from the database and sends it to the Celery workers.

Features

  • Querybook compose queries with autocompletion and hovering tooltip.
  • The tool uses both scheduling and charting in DataDocs to build dashboards.
  • Querybook has built-in rich-text support for users. 
  • The tool allows live query collaborations. 
  • Users can add additional documentation to the tables.
  • Using this tool, one can get lineage, sample queries, frequent user, search ranking based on past query runs.

How To Use It

According to its developers, there are two main ways to set up the big data analytics tool.

Single-Machine Instant Setup (locally or on a server): 

The single machine method is a quick way to try out Querybook for less than five users. This method uses docker-compose to bring up all the necessary infrastructure, which is why Docker needs to be installed for quick setup.

For installation, open terminal and run the following: 

git clone https://github.com/pinterest/querybook.git

cd querybook

Now run the following: 

make

Multi-Machine Setup

The multi-machine setup is required when someone wants to scale Querybook for thousands of users. The multi-machine set up runs Querybook containers on different machines/pods. This method is more complicated than the single machine instant set up and requires external infrastructure. The set up includes the following steps-

Step-1: Requirements

  • A MySQL/PostgresSQL[^1] database with version >=5.7. It is recommended to have more than 5GB of space.
  • An Elasticsearch server with version 6.6.1.
  • A 2GB Redis instance, Querybook should not use more than 1GB of memory.
  • If OAuth will be used for authentication, remember to get the OAuth client information (secrets, token url, etc).
  • For notifications, you would need either a Slack API Token or an email address and the email server running on port 25 of the web server. 

Step-2: Choose the instances

  • In this step, users will need to deploy three different services for Querybook. The web servers handle the HTTP/WebSocket traffic, the workers handle the async tasks such as running the query, and the scheduler sends scheduled tasks to the workers. Also, it is important to make sure to only have one instance of scheduler running to prevent duplication in scheduled tasks and have at least two workers for rolling restart deployments.

Step-3: Update your environment variables configuration 

Step-4: Start each service

You can start each service by the following commands:

  • Webserver: make web
  • Celery worker: make worker
  • Scheduler: make scheduler

Wrapping Up

To make it generic while preserving some of the Pinterest-specific integrations, the developers decided to have a two-layer organisation through a plugin system and add an Admin UI. The Admin UI allows organisations to configure the query engines, table metadata ingestion, and access permissions from a single friendly interface. The plugin system integrates Querybook with the internal systems at Pinterest by utilising Python’s importlib.

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top