At 12 noon yesterday chaos broke out: A global and widespread drop in millions of web pages, services and online platforms triggered panic. It seemed like what we experienced a few years ago with Wannacry, the malware that hacked the entire world and brought down institutions, hospitals, banks, etc. But what happened was not classified as a cyber attack, but as a failure. A failure caused by a network of servers.
Error de Fastly
Fastly is a CDN (Content Delivery Network), an American provider of cloud computing services. A CDN or content delivery network is made up of a group of geographically distributed servers that work together to provide fast delivery of Internet content.
And your role is prevent things like today from happening, since they are in charge of replicating the web pages or some services of the clients who hire their services, doing this on several servers in different parts of the world. But as we saw, something came out and the result was a generalized 60-minute historical drop in the world’s Internet media – we’re talking about millions of dropped web pages.
Fastly’s official explanation
But why did this happen? What did they do at Fastly to make it truly brown on a global level? Well the company has published the official causes, and they are summarized as expected: Someone played where they should, but inadvertently made everything jump.
According to the official explanation, we go back to May 12, when Fastly started “a software deployment that introduced an error that could be activated by a specific configuration of the client in concrete circumstances ”.
We skip ahead to yesterday, June 8: In the early hours of the day, “a customer entered a valid configuration change which included the specific circumstances that triggered the failure, causing 85% of our network to return errors. “ The chronology of the events is as follows, with all the hours in UTC (in Spanish time we would have to add 2, so that 09:47 am UTC would be 11:47 am in Spain):
- 09:47 am: Start of global outage
- 09:48 am: Global outage identified by Fastly monitoring
- 09:58 am: A status message is posted
- 10:27 am: Fastly engineering department identifies customer configuration
- 10:36 am: Affected services begin to recover
- 11:00 am: Most services have recovered
- 12:35 pm: Incident mitigated
- 12:44 pm: Publication of the resolved status
- 17:25 pm: Bug fix rollout begins
1 minute to detect it, 1 hour to fix it
Once the immediate effects are mitigated, “we are dedicated to correcting the failure and communicating with our customers. We created a permanent fix for the bug and started rolling it out at 5:25 p.m. “. The funny thing according to the chronology is that Fastly’s team detected the error in as little as 60 seconds, but most of the millions of affected websites were either buggy or directly ‘down’ for about 60 minutes.
Fastly will also conduct a full investigation into your practices during the incident, as well as to determine why it failed to detect the error that caused the global crash in its review processes. You will also evaluate ways to improve your troubleshooting time.