AWS faced a string of outages in December 2021 –three in a span of two weeks to be exact. In the third outage, the cloud computing giant’s East Coast data centre in North Virginia went down. The disruption lasted for hours, affecting central and eastern United States, and threw Slack, Epic Games Store and Amazon websites into disarray. AWS later said it was looking at its “increased EC2 launch failures and networking connectivity issues.”
AWS glitches are not a new phenomenon. Clients either stretch their architecture across multiple geographic regions or use multiple providers to work around cloud outages. However, the cost and effort involved in such measures can set the clients back. The best way to defend against sudden failures is to build resilient infrastructure.
Of late, Amazon is committing a lot of resources to solve this costly problem. To deal with the pressures of sudden traffic on a global scale, live events and popular streaming premieres, Amazon Prime Video team employed a mix of machine learning and continuous resilience. Apart from this, the team also ensured failovers functioned seamlessly.
Sign up for your weekly dose of what's up in emerging technology.
Machine learning to forecast workload
Workload forecasting with ML involves taking into account a few variables, including timelines for feature rollouts, long-term planning, marketing strategies, seasonalities, and customer metrics, to project the workload trajectory.
According to Ali Jalali (applied scientist at Amazon), using classical time series models combined with deep learning helped them zero in on an optimal risk level to the area for the forecast. The team built predictive models to correctly forecast the spike in traffic for the popular Indian series, Mirzapur.
The platform has now created a similarity engine using past Amazon events and feedback from social media and IMDB ratings to predict hype. The resiliency of a system can be tested against this expected hype. This helped the Prime Video team determine which data centres can handle different workloads based on their latency and availability.
Machine learning for incident management
Automating the response to an incident can massively reduce the time taken to mitigate the issue while also easing the process for engineers. The model will locate the likely error and troubleshoot it while searching for patterns of false alarms.
An automated system well-fed with data will reduce such incidents over time. The goal isn’t to remove humans completely from incident management but to make their jobs easier.
The DevOps team at Prime Video built ML models to determine workload demand and used chaos engineering to practice IT incident recovery. Chaos engineering entails injecting a little latency into a function in a controlled environment. Gradually more functions are injected with latency until the breaking point. The intention behind this is to observe how the system reacts during an unforeseen event and consequently help the device build greater self-belief to perform normally in the face of such events.
To start with, a backup data centre must be checked for its failover reliability. More outages have resulted from IT and network problems than power issues in the recent past.
There are three major components to reactive architecture:
- Elasticity: Scaling applications as and when required.
- Responsive: This maintains a system that is always aware of its surroundings, making it more alert and cautious.
- Resilience: This ensures the system is robust and up-and-running come what may.
Later, the Prime Video team came up with a resilience score. The score indicated the team’s preparedness to deal with failures, avoid downtime, accept failures as a norm, and design contingency plans. However, the resilience score is not a marker for the system’s performance but a report that helps the team understand how to prioritise.