Inside LinkedIn’s Big Data Pipelines

In 2019, LinkedIn released Dagli, an open-source machine learning library for Java.

LinkedIn is the largest professional and employment-oriented service platform. The company has been leveraging AI/ML to optimise various processes such as job postings, job recommendations and business insights. LinkedIn sees more than 210 million job applications submitted every month to the 57 million companies listed on the platform.  

LinkedIn’s Daily Executive Dashboard (DED) contains metrics on critical growth, engagement and bookings. It monitors and provides reports on important KPIs for business profiles, indicating the health of LinkedIn’s business. In addition, the LinkedIn system visualises more than 40 metrics across the business lines to provide company leaders with business insights promptly on their dashboards. 

The process begins with ingesting billions of records from online sources into HDFS. The Hadoop Distributed File System is designed to run on commodity hardware. The system manages data processing and storage for big data applications by providing high throughput access to application data. LinkedIn’s records are aggregated across more than 50 offline data flows, making its huge dataset applicable for Hadoop.  


Sign up for your weekly dose of what's up in emerging technology.

To ensure business continuity, LinkedIn picked Teradata to meet the growing demands in batch processing. Big Data Engineering built and maintained the DWH’s data flows and datasets. LinkedIn’s data warehouse had grown to 1,400+ datasets, forcing the company’s hand to build data pipelines.

In 2020, Apache Kafka was processing 7 trillion incoming messages per day, making it critical to improve their data infrastructure. A core part of the LinkedIn architecture, the open sourced stream processing platform powers use cases like activity tracking, message exchanges and metric gathering. LinkedIn maintains over 100 Kafka clusters with more than 4,000 brokers, which serve more than 100,000 topics and 7 million partitions. 

In 2019, LinkedIn released Dagli, an open-source machine learning library for Java. Dagli allows the user to write bug-resistant, modifiable, maintainable, and trivially deployable model pipelines without the technical debt. Based on highly multicore CPUs and powerful GPUs, the library has high efficacy for training real-world ML models.  

Dagli works on servers, Hadoop, command-line interfaces, IDEs, and other typical JVM contexts. 

Dagli’s features include: 

  • Single pipeline: The entire model pipeline is defined as a directed acyclic graph, eliminating the implementation of a pipeline for training and a separate pipeline for inference.
  • Ease of deployment: The entire pipeline is serialised and deserialised as a single object.
  • Batteries included: Plenty of pipeline components are ready to use including neural networks, logistic regression, gradient boosted decision trees, FastText, cross-validation, cross-training, feature selection, data readers, evaluation, and feature transformations.

Key internal developments to ensure efficient working of LinkedIn’s data pipelines include: 

  1. Workflow automation: Azkaban, an open-sourced tool, automate jobs on Hadoop.
  1. ETL (Extract Transform Load):  Brooklin, open-sourced and distributed system, enables streaming data between various source and destination systems and change-capture for Espresso and Oracle. Linkedin also uses Gobblin, a data integration framework to ingest internal and external data sources to Hadoop. This was used to compel the platform-specific ingestion framework to allow more operability and extensibility. 

Celia Kung, data pipeline manager at LinkedIn, talked about Brooklin at the 2019 QCon. Brooklin is a streaming data pipeline service to propagate data from various source systems to different destination systems. It can run several thousand streams at the same time within the same cluster. Additionally, each of these data streams can be individually configured and dynamically provisioned. Brooklyn allows the user to configure pipelines to enforce policies and manage data in one centralised location. 

“Every time you want to bring up a pipeline that moves data from A to B, you simply need to just make a call to our Brooklin service, and we will provide that dynamically, without needing to modify a bunch of config files, and then deploy to the cluster,” Kung explained

3. ETL/Ad-hoc querying languages

  • Spark SQL/Scala
  • Hive SQL
  • Trino

4. Metric framework: LinkedIn uses Unified Metrics/Dimension Platforms, a common platform to develop, maintain, and manage metric datasets. It enhances the governance by allowing any engineer to build their metric dataset use case within the centralised platform. 

5. Reporting: Retina is LinkedIn’s internally developed reporting platform for data visualisation needs and supporting complex use cases, such as DED.

Current architecture of the Daily Executive Dashboard pipeline

The team migrated DED datasets and flows to the new technologies and open systems such as Hadoop, Azkaban, and UMP/UDP enabled decentralised dataset development within LinkedIn. It allowed them to switch the one team bottleneck metric dataset development by consolidating the bloated DWH into fewer sources of truth datasets. This resulted in the team serving DED metrics using only one-third as many datasets, leading to greater leverage, data consistency and better governance. 

The team automated this process by sending the DED report from whichever cluster first completed the entire report generation and enabled an active-active set-up on the two production Hadoop clusters. In addition, through cross-cluster rollouts and a canary for testing changes, the team ensured the stability of DED amidst data flow. 

The team uses various strategies based on factors like underlying technology and infrastructure to develop the long-running flows. For instance, long-running data flow written in Pig migrated to Spark for a runtime improvement. The storage format of upstream data was converted from Avro to a columnar format for a 2.4X runtime improvement. 

Lastly, to prevent fluctuations in Hadoop clusters and prevent the nodes from bottlenecking on the system resources, the team added monitoring tools. These enable the platform and engineers to isolate system issues from performance issues. 

“The scope of this is far beyond just delivering a dashboard in our executives’ inboxes by 5:00 a.m., but rather codifying and pushing for best practices in ETL development and maintenance,” according to Jennifer Zheng. “By setting an example, other critical data pipelines have modelled after DED in high availability and disaster recovery through active-active and redundant set-ups.” 

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM