With over 125 million members spread across 191 countries, Netflix has an uphill task of providing high-quality services to all its members. Every time a member interacts with client applications which are on over 250 million devices, data points are generated. These interactions result in over 200 billion daily data points.
Netflix’s data infrastructure generates over a trillion events per day and stores over 100 PB of data. This data has to be collected, stored and should be ready for further analysis. The data infrastructure gets complex with more data and various data ingestion patterns.
The Netflix engineering team has a complex job of building an accurate data lineage system to map out data repositories, dashboards, ad-hoc queries and other such data-artefacts.
The data life cycle consists of three main stages:
Data lineage is the life cycle of data from its origins and the changes it undergoes as it moves across the pipeline. Tracking this important as it improves the visibility of a pipeline and enables the engineer to have more control and help trace errors back to their sources.
The lineage data is supplemented with entity metadata so that it becomes more significant in the application for specific use cases. For this, the data engineering team at Netflix uses Metacat data, an internal metadata store and service.
This data is accessed through interfaces like SQL and a REST Lineage Service against a graph database. The architecture should be scalable and must enable cross-functional collaboration and ensure data integrity.
At Netflix, the engineers have two approaches- push and pull for data ingestion.
The pull-heavy model, which is more popular currently, operates by scanning the system logs and metadata generated by various engines for data collection.
Spark is one such compute engine which utilises spark plan information. And, for deriving scheduled ETL jobs and runtime metadata, Meson scheduler APIs are used.
During the conformance phase, data collected from various sources is checked for its consistency- formats of tables, reports etc. The components of the consumption phase are a Graph database, REST Lineage service along with PRISM for entity risk scoring and data efficiency dashboards.
Establishing An End-To-End Data Lineage
The data generated by the Netflix platform is diverse and vast. The data ingestion will require multiple layers designed and customised to address several ingestion patterns. This, in turn, adds to the operational complexity.
To address data ingestion challenges and also to improve data accuracy, the engineering team at Netflix deploys AWS S3 access logs to identify entity relationships that might have gone unnoticed by other traditional data ingestion processes.
Moreover, to make publishing lineage data to pipelines easier, a CRUD layer is being designed.
The end goal at Netflix is to provide universal data lineage that houses all the data representations and the team plans to leverage graph database and a lineage REST service, GraphQL interface to improve developer productivity.
In short, CRUD is a set of primitive operations used mostly for databases and static data storages. Whereas, REST is a very high-level API style used for web services and other 'live' systems.
The infrastructure is a complex multi-tenant environment and data-driven. To maintain efficiency, the engineering team provides every microservice owner with the right set of information.
A decade ago, Netflix changed the way things are done by rewriting the applications that run the entire service to fit into a microservices architecture . Each of these microservice’s code and resources are its own.
To improve the quality of devices, the data teams at Netflix push to build better telemetry to reduce the impact of complexities at the infrastructure level on the responsiveness of the on-device Netflix application.
Netflix has successfully improved the reliability of its data infrastructure by establishing an end-to-end data lineage across all data artefacts at an extremely granular level. Also by forecasting accurate job SLAs, Netflix plans to increase company-wide trust in the data and enhanced efficiency with better data retention.
Register for our upcoming events:
- Join the Grand Finale of Intel Python HackFury2: 21st Oct, Bangalore
- WEBINAR: HOW TO BEGIN A CAREER IN DATA SCIENCE | 24th Oct
- Machine Learning Developers Summit 2020: 22-23rd Jan, Bangalore | 30-31st Jan, Hyderabad