How Twitter Dumped Hadoop To Adopt Google Cloud

Back in May 2018, Twitter announced that they were collaborating with Google Cloud to migrate their services to the cloud. Today, after two years, they have successfully migrated and have also started reaping the benefits of this move.

For the past 14 years, Twitter has been developing its data transformation pipelines to handle the load of its massive user base. The first deployments for those pipelines were initially running in Twitter’s data centers. For example, Twitter’s Hadoop file systems hosted more than 300PB of data across tens of thousands of servers.

This system had some limitations despite having consistently sustained massive scale. With increasing user base, it became challenging to configure and extend new features to some parts of the system, which led to failures.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In the next section, we take a look at how Twitter’s engineering team, in collaboration with Google Cloud, have successfully migrated their platform.

How The Migration Took Place

For the first step, Twitter’s team left a few pipelines such as the data aggregation legacy Scalding pipelines, unchanged. These pipelines were made to run at their own data centers. But the batch layer’s output was switched to two separate storage locations in Google Cloud.

The output aggregations from the Scalding pipelines were first transcoded from Hadoop sequence files to Avro on-prem, staged in four-hour batches to Cloud Storage, and then loaded into BigQuery, which is Google’s serverless and highly scalable data warehouse, to support ad-hoc and batch queries.

This data from BigQuery is then read by a simple pipeline deployed on Dataflow, and some light transformations are applied. Finally, results from the Dataflow pipeline are written into Bigtable. This is a Cloud Bigtable for low-latency, fully managed NoSQL database that serves as a backend for online dashboards and consumer APIs.

With the successful installation of the first iteration, the team began to redesign the rest of the data analytics pipeline using Google Cloud technologies.

After evaluating all possible options, the team chose Apache Beam because of its deep integration with other Google Cloud products, such as Bigtable, BigQuery, and Pub/Sub, Google Cloud’s fully managed, real-time messaging service.

A BigQuery slot can be defined as a unit of computational capacity required to execute SQL queries.

The Twitter team re-implemented the batch layer as follows: 

  • Data is first staged from on-prem HDFS to Cloud Storage
  • A batch Dataflow job then regularly loads the data from Cloud Storage and processes the aggregations, and 
  • The results are then written to BigQuery for ad-hoc analysis and Bigtable for the serving system.

For instance, the results showed that processing 800+ queries(~1 TB of data each) took a median execution time of 30 seconds.

Migration final picture via GCP

The above picture illustrates the final architecture after the second step of migration. 

For job orchestration, the Twitter team built a custom command line tool that processes the configuration files to call the Dataflow API and submit jobs. 

What Do The Numbers Say

via Twitter 

The migration for modernization of advertising data platforms started back in 2017, and today, Twitter’s strategies have come to fruition, as can be seen in their annual earnings report. 

The revenue for Twitter can be divided mainly into two categories:

  • Ads
  • Data licensing and other services.

According to the quarterly earnings report for the year 2019, Twitter has declared decent profits with steady progress.  

“We reached a new milestone in Q4 with quarterly revenue in excess of $1 billion, reflecting steady progress on revenue product and solid performance across most major geographies, with particular strength in US advertising,” said Ned Segal, Twitter’s CFO.

The 2019 revenue was $3.46 billion, which is an increase of 14% year-over-year. 

  • Advertising revenue totalled $885 million, an increase of 12% year-over-year
  • Total ad engagements increased by 29% year-over-year

The motivation behind Twitter’s migration to GCP also involves other factors like the democratization of data analysis. For Twitter’s engineering team, visualization, and machine learning in a secure way is a top priority, and this is where Google’s tools such as BigQuery and Data Studio came in handy.

Although Google’s tools were used for simple pipelines, Twitter, however, had to build their infrastructure called Airflow. In the area of data governance, BigQuery services for authentication, authorization, and auditing did well but, for metadata management and privacy compliance, in house systems had to be designed.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox