Last updated September 9, 2020
In AI Origins & Evolution

How Twitter Dumped Hadoop To Adopt Google Cloud

Published on March 30, 2020

by Ram Sagar

Back in May 2018, Twitter announced that they were collaborating with Google Cloud to migrate their services to the cloud. Today, after two years, they have successfully migrated and have also started reaping the benefits of this move.

For the past 14 years, Twitter has been developing its data transformation pipelines to handle the load of its massive user base. The first deployments for those pipelines were initially running in Twitter’s data centers. For example, Twitter’s Hadoop file systems hosted more than 300PB of data across tens of thousands of servers.

This system had some limitations despite having consistently sustained massive scale. With increasing user base, it became challenging to configure and extend new features to some parts of the system, which led to failures.

In the next section, we take a look at how Twitter’s engineering team, in collaboration with Google Cloud, have successfully migrated their platform.

How The Migration Took Place

For the first step, Twitter’s team left a few pipelines such as the data aggregation legacy Scalding pipelines, unchanged. These pipelines were made to run at their own data centers. But the batch layer’s output was switched to two separate storage locations in Google Cloud.

The output aggregations from the Scalding pipelines were first transcoded from Hadoop sequence files to Avro on-prem, staged in four-hour batches to Cloud Storage, and then loaded into BigQuery, which is Google’s serverless and highly scalable data warehouse, to support ad-hoc and batch queries.

This data from BigQuery is then read by a simple pipeline deployed on Dataflow, and some light transformations are applied. Finally, results from the Dataflow pipeline are written into Bigtable. This is a Cloud Bigtable for low-latency, fully managed NoSQL database that serves as a backend for online dashboards and consumer APIs.

With the successful installation of the first iteration, the team began to redesign the rest of the data analytics pipeline using Google Cloud technologies.

After evaluating all possible options, the team chose Apache Beam because of its deep integration with other Google Cloud products, such as Bigtable, BigQuery, and Pub/Sub, Google Cloud’s fully managed, real-time messaging service.

A BigQuery slot can be defined as a unit of computational capacity required to execute SQL queries.

The Twitter team re-implemented the batch layer as follows:

Data is first staged from on-prem HDFS to Cloud Storage
A batch Dataflow job then regularly loads the data from Cloud Storage and processes the aggregations, and
The results are then written to BigQuery for ad-hoc analysis and Bigtable for the serving system.

For instance, the results showed that processing 800+ queries(~1 TB of data each) took a median execution time of 30 seconds.

The above picture illustrates the final architecture after the second step of migration.

For job orchestration, the Twitter team built a custom command line tool that processes the configuration files to call the Dataflow API and submit jobs.

What Do The Numbers Say

The migration for modernization of advertising data platforms started back in 2017, and today, Twitter’s strategies have come to fruition, as can be seen in their annual earnings report.

The revenue for Twitter can be divided mainly into two categories:

Ads
Data licensing and other services.

According to the quarterly earnings report for the year 2019, Twitter has declared decent profits with steady progress.

“We reached a new milestone in Q4 with quarterly revenue in excess of $1 billion, reflecting steady progress on revenue product and solid performance across most major geographies, with particular strength in US advertising,” said Ned Segal, Twitter’s CFO.

The 2019 revenue was $3.46 billion, which is an increase of 14% year-over-year.

Advertising revenue totalled $885 million, an increase of 12% year-over-year
Total ad engagements increased by 29% year-over-year

The motivation behind Twitter’s migration to GCP also involves other factors like the democratization of data analysis. For Twitter’s engineering team, visualization, and machine learning in a secure way is a top priority, and this is where Google’s tools such as BigQuery and Data Studio came in handy.

Although Google’s tools were used for simple pipelines, Twitter, however, had to build their infrastructure called Airflow. In the area of data governance, BigQuery services for authentication, authorization, and auditing did well but, for metadata management and privacy compliance, in house systems had to be designed.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.