Fastly's global outage: Here's what went wrong

3 years ago 277
BOOK THIS SPACE FOR AD
ARTICLE AD

Content delivery network (CDN) Fastly has explained its major outage yesterday, which knocked out many of the world's top websites, from Amazon to ZDNet. 

The breadth of the outage demonstrated once again how CDNs, which bring content to end users from globally distributed points of presence (POPs), can also be a single point of failure. 

Fastly has POPs across the globe running on solid state drives (SSDs) that make up its "edge cloud" for delivering web content from data centers that are closer to end users. Instead of accessing a website's servers directly, users access a cache of the site from cache storage maintained by the CDN. 

SEE: Network security policy (TechRepublic Premium)

Its global outage yesterday briefly prevented web users from accessing The Guardian, the Financial Times, The New York Times, ZDNet, Reddit, Twitch, Amazon, PayPal, and the UK government website gov.uk

Nick Rockwell, Fastly's senior vice president of engineering, said the hour-long outage happened because a customer pushed a configuration change that triggered the undiscovered software bug

Rockwell doesn't explain what exactly happened, other than saying that on May 12, the company deployed a software update that "introduced a bug that could be triggered by a specific customer configuration under specific circumstances."

Then yesterday, June 8, a customer pushed a configuration change that met the conditions to trigger the bug, which caused 85% of its network to return errors. End users visiting affected sites saw the "Error 503 Service Unavailable" error message in browsers. 

Fastly yesterday said that issue was causing customers to see an "increased origin load and lower Cache Hit Ratio (CHR)". CHR is a measure of how many requests a cache can deliver compared to how many requests it receives.

"Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25," said Rockwell. 

The disruption began at 9:47 UTC. 

Fastly is the seventh largest CDN provider, following Google, Cloudflare, F5, Amazon CloudFront, and jsDelivr, according to Datanyze.

SEE: GDPR: Fines increased by 40% last year, and they're about to get a lot bigger

The pitfall of CDNs is that when they go down, as Cloudflare did in 2019 – due to a buggy configuration change – users can't access websites that rely on the CDN to deliver content. 

Rockwell recognized that the company should have seen this bug before the customer accidentally triggered it. He also apologized to customers. 

"Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority," he wrote.  

"We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support."

Read Entire Article