Earlier this month, LinkedIn’s tech team took to a blog post to announce that the professional networking platform has transitioned its analytics stack to one open-source big data technology. The analytics tech stack at LinkedIn earlier included more than 1400 datasets, 2100-plus users, and 900-plus data flows.
The journey that freed Linkedin from the limitations of using third party proprietary platforms (3PP). Additionally, it helped LinkedIn save millions in the form of support, licensing and other operating expenses. “…The move gave us more control over our own destiny via the freedom to extend and enhance our components and platforms,” the authors wrote.
The whole process of large-scale tech transition is usually complex and often accompanied by delays. However, LinkedIn claims that its tooling and approach allowed smooth transition with zero negative production impact. LinkedIn used this opportunity to re-envision its warehouse strategy.
Today, we dive deep to understand how LinkedIn’s analytics tech stack has evolved over the years, along with understanding:
- How LinkedIn navigated and executed the large-scale data and user migration
- Used the tech transition to improve its data ecosystem
- Learnings from the experience
How it started
Founded in 2002, LinkedIn’s team earlier leveraged the 3PP data platform to accelerate its growth. Despite the limitations, this route met the company’s technology needs and was a quicker way to bring together the off-the-shelf products of the company. The following figure denotes LinkedIn’s early-stage tech stack.
Source: LinkedIn
LinkedIn continued using this tech stack for the following six years and had to reiterate after facing the following challenges:
- The closed nature of the system curbed the freedom to evolve, and the integration of the open-source and internal systems was challenging.
- Additionally, with the data pipeline being limited to a small central team because of the limitations of Appworx and Informatica licenses, it was difficult to scale the platform.
These limitations drove the team to develop a new data lake on Hadoop in parallel. There was no clear transition process, and the team continued maintaining the old data warehouse along with the new one. Having done that, data had to be copied between the two stacks, which ultimately led to double the complexity, maintenance and confused consumers.
Source: LinkedIn
As a result, LinkedIn planned and executed a migration of all the datasets to the new stack.
Dataset lineage and usage
Right at the beginning, the tech team figured that it would be challenging to begin the massive migration without first deriving the dataset lineage and usage. This understanding enabled the team to not just plan the order of the dataset migration (beginning with datasets without dependencies and then working upwards towards their dependencies), but also to identify the zero or low usage datasets for workload reduction and track the percentage of users on the new and the old systems, which eventually became the key performance indicator or KPI.
There were no unconventional solutions that would support LinkedIn’s heterogeneous environment that included 3PP, Teradata(TD) and Hadoop. While there was a project underway to build dataset lineage as a part of DataHub, the process could not be completed in the expected timeline, which is why LinkedIn got on board to build the tooling up front to help plan and execute the migration.
At the beginning of the process, the team created a data product to provide the upstream and downstream dependencies of TD datasets. Then, they created data pipelines to extract the usage information. In order to obtain TD metadata to support the efforts, the team got rid of the TD logs. They figured a better solution to obtain Hadoop metadata. The team added instrumentation to MapReduce and Trino to emit Kafka events with detailed dataset access metadata, post which the events were ingested by a Gobblin job into HDFS and processed with a Hive script for downstream consumption and dashboards.
Source: LinkedIn
Using these tools, LinkedIn catalogued its datasets to assist the planning. These catalogues not just helped in planning a major data model revision but also provided a holistic view, helping surface redundant datasets and conform dimension tables. This allowed them to consolidate 1424 datasets into 450. Additionally, they de-normalised and pre-joined tables to streamline analytics.
Source: LinkedIn
Migration to a new system
The new ecosystem’s design was influenced by LinkedIn’s migration away from TD and addressed the earlier pain points:
- The new Hadoop ecosystem allowed data adoption and development by other teams, unlike earlier when only the central team could build TD data pipelines.
- Open-source tech stack allowed easy enhancement and custom-built projects, enabling LinkedIn to develop many innovations to handle data at scale.
- Running Hadoop and TD parallelly meant extra cost and complexity. The new system brought together the tech and workflows, enabling efficient maintenance and enhancements in a single place.
The new analytics tech stack is depicted in the following diagram:
Source: LinkedIn
The new stack consists of a Unified Metrics Pipeline (UMP) for developers to provide ETL scripts to create data pipelines; a distributed workflow scheduler— Azkaban, to manage jobs on Hadoop; and dataset readers that can be read through dashboards and ad-hoc queries for business analytics or Data Access at LinkedIn (DALI) reader.
Additionally, to improve the data pipeline performance, the team migrated from Avro file format to ORC. This increased the read speed from 10 to 1000x, along with a 25 to 50 per cent improvement in compression ratio. They also migrated the Hive or Pig flows to Spark, reducing the runtimes by close to 80 per cent.
However, that was not the end. Post dataset migration, the team had to arrange for the migration of the 2100-plus TD users and the deprecation of 1400-plus datasets. Manual migration was tedious, expensive and prone to human error. Whereas automating the process required the creation of a new service altogether. Thus, the team decided to use the best of both by coordinating the task with the help of automation.
The backend had a MySQL operational datastore. Additionally, they built an API service to coordinate the deprecation, identifying the right candidates and sending them emails about the upcoming deprecation, notifying them to save to delete their dataset from TD. This ensured that the process was less tedious and less expensive.
The image below depicts user migration using dataset deprecation tools.
Source: LinkedIn
Summing up
From their personal experience, the team at LinkedIn have figured that tech migrations require gigantic efforts. First, companies planning a tech migration should lookout for opportunities to improve their ecosystem. Second, they should invest in performance and feature gaps early in the process to enable smoother tech transition. And finally, they should build automated solutions wherever possible.
Going forward, LinkedIn plans to indulge in larger tech transformations. At present, the team is moving its analytics tech stack to Microsoft Azure while they continue to build on their learnings.