Top social media apps — Facebook, Instagram, and WhatsApp — suffered one of the longest global outages last night. For about six hours, all the three apps, which command a total of over 3.5 billion monthly core users, were not working. This outage grazed Facebook founder Mark Zuckerberg’s fortune by $6 billion and pushed him down a few positions in the world’s richest list.
In an official statement issued by Facebook, the cause behind the unprecedented outage was a configuration change in the backbone routers that coordinate network traffic between the data centres. This had a cascading effect and brought all the Facebook services to a halt. In layman terms, everything that Facebook runs disappeared for a period of time.
What Went Wrong
The company talks about the Border Gateway Protocol (BGP) in a detailed blog by website security firm Cloudflare. It is a mechanism to exchange routing information among autonomous systems. The list of possible routes is constantly updated by the Internet routers. If BGP stops working, these routers will not know what to do, bringing the Internet to a halt. So while DNS (domain name system) is the address system of the location of each website or the IP address, BGP is the roadmap that helps find the most efficient route to get to that address.
Further, an individual network with a unified internal routing policy is called an Autonomous System (AS), each of which has an Autonomous System Number or ASN. An AS can originate prefixes — control a group of IP addresses — and transit prefixes — tell how to reach specific groups of IP addresses.
As per Cloudflare, Facebook stopped announcing the routes to their DNS prefixes, which means that its DNS servers were unavailable. A BGP UPDATE message gives information about any changes made to a prefix advertisement or entirely withdraws the prefix. Cloudflare said that for Facebook, this chart is largely unchanged as the social media giant does not make a lot of changes to its network minute by minute. However, before the global outage, Cloudflare observed a lot of routing changes from Facebook. This led to the routes being withdrawn, and the DNS went offline. With these changes, Facebook and associated sites were effectively disconnected from the Internet.
Facebook found itself in a deep slump because it was not able to fix the issue quickly. The company’s own internal systems run from the same place, so it was difficult for the staff to resolve the problem; they were restricted from accessing their own communications and were unable to access their office due to the security pass system being affected during the outage. As per reports, Facebook sent a technical team to its servers in California to manually reset the servers where the problem originated.
Interestingly, when Facebook and related apps were down, people started looking for alternatives. A lot of DNS queries to Twitter, Signal and other social media apps increased.