LinkedIn is the largest professional and employment-oriented service platform. The company has been leveraging AI/ML to optimise various processes such as job postings, job recommendations and business insights. LinkedIn sees more than 210 million job applications submitted every month to the 57 million companies listed on the platform.
Sign up for your weekly dose of what's up in emerging technology.
To ensure business continuity, LinkedIn picked Teradata to meet the growing demands in batch processing. Big Data Engineering built and maintained the DWH’s data flows and datasets. LinkedIn’s data warehouse had grown to 1,400+ datasets, forcing the company’s hand to build data pipelines.
In 2020, Apache Kafka was processing 7 trillion incoming messages per day, making it critical to improve their data infrastructure. A core part of the LinkedIn architecture, the open sourced stream processing platform powers use cases like activity tracking, message exchanges and metric gathering. LinkedIn maintains over 100 Kafka clusters with more than 4,000 brokers, which serve more than 100,000 topics and 7 million partitions.
In 2019, LinkedIn released Dagli, an open-source machine learning library for Java. Dagli allows the user to write bug-resistant, modifiable, maintainable, and trivially deployable model pipelines without the technical debt. Based on highly multicore CPUs and powerful GPUs, the library has high efficacy for training real-world ML models.
Dagli works on servers, Hadoop, command-line interfaces, IDEs, and other typical JVM contexts.
Dagli’s features include:
- Single pipeline: The entire model pipeline is defined as a directed acyclic graph, eliminating the implementation of a pipeline for training and a separate pipeline for inference.
- Ease of deployment: The entire pipeline is serialised and deserialised as a single object.
- Batteries included: Plenty of pipeline components are ready to use including neural networks, logistic regression, gradient boosted decision trees, FastText, cross-validation, cross-training, feature selection, data readers, evaluation, and feature transformations.
Key internal developments to ensure efficient working of LinkedIn’s data pipelines include:
- Workflow automation: Azkaban, an open-sourced tool, automate jobs on Hadoop.
- ETL (Extract Transform Load): Brooklin, open-sourced and distributed system, enables streaming data between various source and destination systems and change-capture for Espresso and Oracle. Linkedin also uses Gobblin, a data integration framework to ingest internal and external data sources to Hadoop. This was used to compel the platform-specific ingestion framework to allow more operability and extensibility.
Celia Kung, data pipeline manager at LinkedIn, talked about Brooklin at the 2019 QCon. Brooklin is a streaming data pipeline service to propagate data from various source systems to different destination systems. It can run several thousand streams at the same time within the same cluster. Additionally, each of these data streams can be individually configured and dynamically provisioned. Brooklyn allows the user to configure pipelines to enforce policies and manage data in one centralised location.
“Every time you want to bring up a pipeline that moves data from A to B, you simply need to just make a call to our Brooklin service, and we will provide that dynamically, without needing to modify a bunch of config files, and then deploy to the cluster,” Kung explained.
3. ETL/Ad-hoc querying languages
- Spark SQL/Scala
- Hive SQL
4. Metric framework: LinkedIn uses Unified Metrics/Dimension Platforms, a common platform to develop, maintain, and manage metric datasets. It enhances the governance by allowing any engineer to build their metric dataset use case within the centralised platform.
5. Reporting: Retina is LinkedIn’s internally developed reporting platform for data visualisation needs and supporting complex use cases, such as DED.
The team migrated DED datasets and flows to the new technologies and open systems such as Hadoop, Azkaban, and UMP/UDP enabled decentralised dataset development within LinkedIn. It allowed them to switch the one team bottleneck metric dataset development by consolidating the bloated DWH into fewer sources of truth datasets. This resulted in the team serving DED metrics using only one-third as many datasets, leading to greater leverage, data consistency and better governance.
The team automated this process by sending the DED report from whichever cluster first completed the entire report generation and enabled an active-active set-up on the two production Hadoop clusters. In addition, through cross-cluster rollouts and a canary for testing changes, the team ensured the stability of DED amidst data flow.
The team uses various strategies based on factors like underlying technology and infrastructure to develop the long-running flows. For instance, long-running data flow written in Pig migrated to Spark for a runtime improvement. The storage format of upstream data was converted from Avro to a columnar format for a 2.4X runtime improvement.
Lastly, to prevent fluctuations in Hadoop clusters and prevent the nodes from bottlenecking on the system resources, the team added monitoring tools. These enable the platform and engineers to isolate system issues from performance issues.
“The scope of this is far beyond just delivering a dashboard in our executives’ inboxes by 5:00 a.m., but rather codifying and pushing for best practices in ETL development and maintenance,” according to Jennifer Zheng. “By setting an example, other critical data pipelines have modelled after DED in high availability and disaster recovery through active-active and redundant set-ups.”