MITB Banner

The Tech Behind Netflix’s Progressive Load Shedding

Share

Netflix Progressive Load Shedding

By April 2020, subscription-based streaming service Netflix saw 16 million new sign-ups, due to nation-wide lockdowns in several countries amid the coronavirus scare. Despite the surge, one may have observed that there has been no significant outage, paralysing smooth streaming. Netflix, through a blog, has detailed a priority-based progressive load shedding technology to improve customer experience on the platform by offering uninterrupted services. This technology has a ‘priority-throttling’ filter named Zuul that can shed unnecessary server requests whenever there is an issue on the backend. 

Priority Progressive Load Shedding Using Zuul

Traffic overloading can happen due to several reasons such as clients triggering multiple retry attempts, an under-scaled service, network issue, or glitch with the cloud provider. Keeping in mind these causes, Netflix set out to make the platform more reliable by — prioritising requests across multiple device types, by progressively throttling requests, and validating assumptions by Chaos Testing.

Zuul prioritises traffic based on how much a user needs it for playback. Netflix focused on three factors — throughput, functionality, and criticality — to categorise request traffic into:

  • NON_CRITICAL: Logs and background requests are some of the examples of this type of traffic, which doesn’t affect playback but has high throughput that contributes majorly to load in the system.
  • DEGRADED_EXPERIENCE: Unlike NON_CRITICAL, this type of traffic affects user experience; however, not the ability to play. It comes into play for features such as stop and pause markers, language selection, and history viewing.
  • CRITICAL: This type affects the ability to play. If the request fails, users will see an error message when they hit play.

Based on the traffic buckets, as discussed above, based on individual characteristics, Zuul computes a priority score between 1 to 100 for each request. When a problem develops on the backend, the Zuul filter throttles loads with the lowest priority first. It means that the request with the highest priority gets a preferential treat and gets served. The implementation is analogous to a queue with a dynamic priority threshold.

During the request lifecycle, Zuul can apply load shedding in two instances —

  • Service throttling: This comes into play when the requests are routed to specific backend service. Zuul monitors error rates and concurrent requests to assess any anomaly. When the threshold percentage for either of these two metrics is crossed, the traffic is throttled to reduce load.
  • Global throttling: This affects all backend services rather than a single backend service — issues such as CPU utilisation, concurrent requests, and connection count trigger global throttling. If Zuul is down, no traffic gets to the backend services that result in a complete outage.

When any of the thresholds listed above are exceeded, the traffic is progressively dropped, starting with the lowest priority. The level of throttling is managed by a cubic function, as given below. With increasing overload percentage, the priority threshold trails it slowly. For example, at 35% overload threshold, the priority threshold is still in the mid-90s. However, if the overload threshold increases, the level hits the sharp side of the curve and throttles everything.

Source: Netflix TechBlog

By adding progressive priority-based load shedding to Zulu, it can shed enough traffic to stabilise the services even without members noticing. For dropping traffic, Zuul sends a signal to devices. Zuul indicates how many retries it can perform and the time period. Zuul uses a backpressure mechanism to stop the retry storm faster.

Source: Netflix TechBlog

Wrapping Up

In 2019, Netflix experienced a major outage which resulted in a large percentage of people who were unable to use the platform for a few hours. Post the deployment of the progressive load shedding mechanism, the platform again experienced a similar situation in 2020, however, due to Zuul, the effect was in control to a large extent and did not hamper the members’ ability to use the platform at all.

In future, the team will be looking at expanding the scope of request priority for other functions such as better retry policies between devices and back-ends, request priority using Chaos Testing, for dynamically changing load shedding threshold, among others.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.