Behind every blockbuster Netflix binge is an amazingly managed data platform. It is a task to envision all the data and the systems that work to ensure the speed at which the platform updates series and TV shows for you, but Netflix’s engineering blogs give us a glimpse into the world of managing tens of thousands of data.
In this article, we break down the company’s operations.
Device Management Platform
The Netflix team has built a reliable data management tool at scale to ensure two ongoing efforts. First, providing the latest releases deliver the same quality of Netflix experience on different device types. Second, upkeep the quality bar while working with its partners to port the Netflix SDK in their devices. Netflix’s Device Management platform is the infrastructural foundation for the Netflix Test Studio. It comprises a computing environment called Reference Automation Environment (RAE) and its complimenting software on the cloud.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
- Service-level abstraction for controlling devices and their environments
- Collect and aggregate information, state updates for all devices attached to the RAEs in the fleet.
The Device Management Platform has regular device updates event sourced through the control plane to the cloud to ensure that the NTS is updated with information about the devices available for testing. The challenge it does face is the ability to ingest and process these events in a scalable manner.
Netflix utilises Alpakka-Kafka for their streaming processing solutions. It provides advanced control over the streaming process, satisfies the system requirements, including the Netflix Spring integration, and the framework is lightweight with a less terse code.
Kafka has excelled at the three indicators of consumption performance: the message fetch rate, the max consumer lag, and the committed rate.
Fetch Request Metrics
While before Kafka, the number of fetch calls remained unchanged across burst events but was otherwise quite unstable over time. After the deployment, the calls followed a 1:1 correspondence with Kafka topic’s message publication rate. The number of fetch calls also remained stable over time.
Alpakka-Kafka-based processor hugely scaled its Kafka consumption to ensure that the system is not under or over-consuming Kafka messages.
Max Consumer Lag
The Kafka consumer lag metrics showed a significant improvement from the previous lag that floated long-term at around 60,000 records, which delay updating information by a significantly long time, making it easier for the users to notice. The Alpakka-Kafka-based processor has decreased the average max consumer lag over time to zero outside the burst event windows and 20,000 records inside the burst event window.
Kafka consumers can perform manual, or automatic offset commits when it fetches records. With auto commits, messages are acknowledged as ‘received’ as soon as they are brought and irrespective of processing. Alpakka-Kafka-based processor lowered the committed rate from 7 kbytes/sec to 50 bytes/sec.
The Data Explorer
Netflix uses their Data Explorer to give the engineers fast and safe access to their data stored in Cassandra and Dynomite/Redis data stores.
Features of Data Explorer:
- Multi-Cluster Access
The data explorer directs users to a single web portal for all of their data stores to increase user productivity. In a production environment with hundreds of clusters, this tool helps reduce the available data stores to those authorised for access.
- Schema Designer
The schema designer in Cassandra allows the users to drag and drop their way to a new table instead of writing ‘Create Table’ statements that users have found to be an intimidating experience. With schema designer, users can create a new table using any collection data type, then designate Netflix partition key and clustering columns.
- Explore Netflix Data
Explore mode lets users execute point queries against Netflix clusters, export result sets to CSV, or download them as CQL insert statements.
- Query IDE
While explore mode supports efficient point queries, the Query mode goes one step further to provide a powerful CQL IDE, including autocomplete and helpful snippets.
- Dynomite and Redis Features
Along with the C*, the data explorer has facilities for Dynomite and Redis users as well.
- Key Scanning
To ensure the clusters aren’t strained since Redis is an in-memory data store, the data explorer allows it to perform SCAN operations across all nodes in the cluster.
The team has also codified their best practices to support various OSS environments and built several adapter layers into the product to ensure custom implementations can be made. In addition, they enabled OSS by introducing seams where users could provide their performances for discovery, access control, and data store-specific connection settings.
The Data Explorer has an overridable configuration that allows mechanisms to override the defaults and specify Netflix custom values for different production environments. The CLI Setup Tool provides a series of prompts to improve the experience of creating a Netflix configuration file. “The CLI tool is the recommended approach for building Netflix configuration files, and you can re-run the tool at any point to create a new configuration,” the team wrote in a blog post.
Data Mesh and Data Movement in Netflix Studio
Netflix is moving more and more towards creating all original content from their Netflix Studio; after a pitch, the series goes through several phases. This poses the challenge of providing visibility of Studio data across all the stages.
The Data Mesh Platform
Data Mesh is a fully managed, streaming data pipeline to enable Change Data Capture (CDC) use cases. Data Mesh allows users to create, source and construct pipelines. Its drag-and-drop, the self-service user interface, allows users to explore sources and create pipelines without the need to manage and scale complex data streaming infrastructure.
This platform allows for data movement in Netflix Studio through its configuration drive, decreasing the lead time when creating a new pipeline. Its offerings also include end-to-end schema evolution, self-serve UI, and secure data access.
The Data Mesh platform powers the data movement across Netflix Studio applications exposing GraphQL queries via Studio Edge. Change Data Capture (CDC) sources connector reads from studio applications’ database transaction logs and emits the change events. These events are passed on to the Data Mesh processor, which issues GraphQL queries to Studio Edge, which lands the data in Iceberg tables in Netflix Data Warehouse. After this, they assist in ad-hoc or scheduled querying and reporting.
Netflix studio partners rely on data for decision making and collaboration during the production phases, for which the Studio Tech Solutions team has to ensure real-time reports.
The Genesis is a Semantic Data Layer created by the team to map data points in Data Sourced Definitions to generate the trackers’ SQL. Genesis joins, aggregates, formats and filters data based on what is available in the Data Source Definitions. Genesis currently powers 240+ trackers.
The generated queries are used for multiple trackers in Workflow Definitions to create data movement workflows managed through Netflix Big Data Scheduler, powered by Titus. The scheduler executes Netflix queries and moves the results to a data tool – a Google Sheet Tab, Airtable base, or Tableau dashboard.
Now, do you finally comprehend the price of binge-watching? Tons and tons of data management programmes.