In 2020, Netflix announced their Studio API and GraphQL Federation to overcome the complexity of developing the API aggression layer. Netflix explained the two-fold goal of GraphQL Federation is to provide a unified API for consumers and give backend developers flexibility and service isolation.
Essentially, the architecture of a GraphQL is to provide a gateway layer that brings together different federated services into one unified API endpoint. Further, it allows teams to independently build and operate their domain graph services while connecting the domains under a unified GraphQL schema. This is why the Content Engineering at Netflix has been transitioning several services to use a federated GraphQL platform.
Netflix explained the three core entities of the graph as movie, production and talent. Separate engineering teams independently own these:
- Movie: Every title in the data is a Movie object.
- Production: Each Movie is associated with a Studio Production. The production object tracks factors integral to making a Movie, such as the shooting location, vendors, and more.
- Talent: These are the people working on the Movie, like the actors, directors, etc.
These entities are made available in the graph. Following this, the teams need the ability to query for a particular entity based on attributes of related entities. For instance, they may want to search for all movies currently being produced with actor Ryan Reynolds. This is the problem of making a federated graph searchable. As a solution, Netflix created Studio Search.
Netflix Studio Search
The Studio Search platform takes a portion of the federated graph and makes it searchable. The portion is a subgraph rooted in an entity of interest that would be queried with text input, filtered, ranked, and faceted. This was done with Elasticsearch as the underlying technology.
The team considered three essential factors in building their index pipeline.
- The definition of the subgraph of interest is the primary search
- Events that notify the platform of changes to entities in the subgraph
- Index specific configuration
“In short, our solution was to build an index for the subgraphs of interest. This index needs to be kept up-to-date with the data exposed by the various services in the federated graph in near-real-time,” said the Netflix engineering team. GraphQL enables this by providing a straightforward way to define the subgraph through a single templated GraphQL query. The query pulls all of the data the user is interested in using their searches.
The ‘Change Data Capture’ events trigger the reindexing operation for individual entities when they change, ensuring the index is up to date. Netflix’s teams leverage Netflix’s CDC connectors. Consequently, the federated graph would fetch all the data indexed, making the factors needed in the events an entity id. When substituted into the GraphQL query template, this id fetches the entity and related data. Elastic search, the underlying technology, is used to create text analysers that can be generalised across domains. The type information from the GraphQL query template and user-specified index configuration are leveraged.
Finally, a Data Mesh pipeline is created based on these inputs. A more recent concept, data mesh is decentralised architecture, marking an organisational shift in the way enterprises manage big data. It solves the challenges of lack of ownership of data and lack of quality data and removes bottlenecks to encourage organisational scaling. Data mesh treats the data as a product. Each source has a data product owner, who could ideally be part of the cross-functional team of data engineers. At Netflix, the mesh consists of the user-provided CDC event source, a processor to enrich those events using the user-provided GraphQL query, and a sink to Elasticsearch.
Image source: Netflix
Netflix’s Studio applications produce events to schematised Kafka streams within Data Mesh in the architectural process. It does so in two ways. First, it processes by transacting with a database monitored by a CDC connector that creates events. Alternatively, it directly creates events using a Data Mesh client. These schematic events are consumed by Data Mesh processors implemented in the Apache Flink framework. A union processor is used to combine data from multiple Kafka streams when entities have multiple events. Next, a GraphQL processor executes the user-provided GraphQL query to fetch documents from the federated gateway. In turn, the federated gateway fetches data from the Studio applications. Lastly, the information fetched from the gateway is put onto another schematised Kafka topic. Then, it is processed by an Elasticsearch sink in data mesh, indexing it into Elasticsearch index – configured with an indexing template personalised for the fields and types present in the document.
Movies are always changing, be it production or talent detail change. The reverse lookup feature is a solution to update the model when a change to a related entity is made. The model looks up all the primary entities that could be affected by the change to reflect the alternations. It then triggers events for the entities. The model does so by consulting the index and querying for all primary entities that may be related. The pipeline also observes a change to the Production, “with ptpId “abc”; we can query the index for all documents with production.ptpId == “abc” and extract the movieId,” the team explained. Later, the movieId is passed down into the rest of the indexing pipeline. The team also revealed this feature to be convenient yet not user friendly.