By April 2020, subscription-based streaming service Netflix saw 16 million new sign-ups, due to nation-wide lockdowns in several countries amid the coronavirus scare. Despite the surge, one may have observed that there has been no significant outage, paralysing smooth streaming. Netflix, through a blog, has detailed a priority-based progressive load shedding technology to improve customer experience on the platform by offering uninterrupted services. This technology has a ‘priority-throttling’ filter named Zuul that can shed unnecessary server requests whenever there is an issue on the backend.
Priority Progressive Load Shedding Using Zuul
Traffic overloading can happen due to several reasons such as clients triggering multiple retry attempts, an under-scaled service, network issue, or glitch with the cloud provider. Keeping in mind these causes, Netflix set out to make the platform more reliable by — prioritising requests across multiple device types, by progressively throttling requests, and validating assumptions by Chaos Testing.
Zuul prioritises traffic based on how much a user needs it for playback. Netflix focused on three factors — throughput, functionality, and criticality — to categorise request traffic into:
- NON_CRITICAL: Logs and background requests are some of the examples of this type of traffic, which doesn’t affect playback but has high throughput that contributes majorly to load in the system.
- DEGRADED_EXPERIENCE: Unlike NON_CRITICAL, this type of traffic affects user experience; however, not the ability to play. It comes into play for features such as stop and pause markers, language selection, and history viewing.
- CRITICAL: This type affects the ability to play. If the request fails, users will see an error message when they hit play.
Based on the traffic buckets, as discussed above, based on individual characteristics, Zuul computes a priority score between 1 to 100 for each request. When a problem develops on the backend, the Zuul filter throttles loads with the lowest priority first. It means that the request with the highest priority gets a preferential treat and gets served. The implementation is analogous to a queue with a dynamic priority threshold.
During the request lifecycle, Zuul can apply load shedding in two instances —
- Service throttling: This comes into play when the requests are routed to specific backend service. Zuul monitors error rates and concurrent requests to assess any anomaly. When the threshold percentage for either of these two metrics is crossed, the traffic is throttled to reduce load.
- Global throttling: This affects all backend services rather than a single backend service — issues such as CPU utilisation, concurrent requests, and connection count trigger global throttling. If Zuul is down, no traffic gets to the backend services that result in a complete outage.
When any of the thresholds listed above are exceeded, the traffic is progressively dropped, starting with the lowest priority. The level of throttling is managed by a cubic function, as given below. With increasing overload percentage, the priority threshold trails it slowly. For example, at 35% overload threshold, the priority threshold is still in the mid-90s. However, if the overload threshold increases, the level hits the sharp side of the curve and throttles everything.
Source: Netflix TechBlog
By adding progressive priority-based load shedding to Zulu, it can shed enough traffic to stabilise the services even without members noticing. For dropping traffic, Zuul sends a signal to devices. Zuul indicates how many retries it can perform and the time period. Zuul uses a backpressure mechanism to stop the retry storm faster.
Source: Netflix TechBlog
Wrapping Up
In 2019, Netflix experienced a major outage which resulted in a large percentage of people who were unable to use the platform for a few hours. Post the deployment of the progressive load shedding mechanism, the platform again experienced a similar situation in 2020, however, due to Zuul, the effect was in control to a large extent and did not hamper the members’ ability to use the platform at all.
In future, the team will be looking at expanding the scope of request priority for other functions such as better retry policies between devices and back-ends, request priority using Chaos Testing, for dynamically changing load shedding threshold, among others.