Fastly, Google and Amazon’s “Bug Already Present” Failure Pattern that Caused the Three Biggest Internet Outages in the Last Year

In all cases, a bug that wasn’t triggered until long after release caused a cascade of failures

Jack Shirazi
4 min readJun 30, 2021

Note the opinions expressed in this blog are solely my own and do not express the views or opinions of any organization I was previously, currently, or will in the future be associated with

pieces of a broken plate spread on the floor
Photo by CHUTTERSNAP on Unsplash

Amazon and Google are two of the world’s leading proponents of Reliability Engineering. Google pioneered the concepts of Site Reliability Engineering (SRE) to the tech world, and has advocated continually, producing some of the most influential books and training materials in the subject. Amazon has strongly adopted Reliability Engineering as a discipline: many of the Amazon sessions at their annual re:Invent meetings are devoted to concepts and activities around Reliability Engineering, and Amazon has perhaps the only major product that provides a 100% reliability SLA guarantee (I’ve written previously about the associated cost of doing this which is why it’s rarely done). And Fastly, as a major global CDN, needs to be available at all times.

So when any of these companies has a significant outage, it’s of major interest to the Reliability Engineering community. And over the last year, each of them had a serious outage that significantly impacted their customers and much of the internet(Amazon’s Kinesis event, Google’s OAuth outage, and Fastly’s configuration outage). What’s even more interesting is that all of these outages had the same underlying Reliability failure pattern — a bug that wasn’t triggered on release, but was instead triggered long after release.

It’s almost impossible to have bug-free services in your production systems. Reliability Engineering recognizes this, assumes that failures will occur, and looks for ways to mitigate what happens so that when failures occur, services continue to be available in some form. But a standard first question in incidents is to ask what has changed recently, because more often than not a recent change caused the incident. Reliability Engineering tends to assume that if a bug is released into production, it’ll be triggered soon after rollout, so the engineering emphasis is on detecting that early and rolling back quickly to a safe known configuration.

Which means a bug triggered long after rollout (where you have no way to correlate the incident to the change) is really a nightmare scenario for Reliability Engineers. This partly explains why Fastly, Google and Amazon can all have major outages despite their Reliability Engineering focus, though of course major outages tend to have more than one thing go wrong as they did in all these cases:

  • In Amazon’s case, the front-end Kinesis admin and routing service had a long-standing implementation bug which limited the number of service instances, as each service instance needed a separate OS thread for an admin connection to each other service instance; increasing the number of instances for capacity reasons hit the OS limit on thread counts; further consequent failures within the services caused corruption in the instances and a cascade of failures (I do a detailed analysis here)
  • In Google’s case, a new bug (incorrectly reporting resource usage as zero in some situations) was deployed in the Quota system with a new version. While the bug was probably triggered relatively soon after deployment, when a new version of the User ID Service started using it, that triggering had no immediate effect because — ironically enough—there was a reliability measure that delayed it taking effect for weeks. Normally that’s ideal, giving operations plenty of time to correct the failure with no impact. But in this case, the bug came in two parts: because no one had thought that zero usage could be reported, there was no check on it. This isn’t surprising, good Reliability Engineering practice would have a check implemented alongside a released feature, but you don’t implement limit checks alongside a bug because you never realized you released it! When the reliability delay expired, the bug caused resource failures on the User ID Service and so the resulting outage (I do a detailed analysis here)
  • Fastly hadn’t published a detailed analysis of their outage at the time of writing this story, just that a bug released about a month previously was triggered by a customer.

Major outages like these are good learning events for Reliability Engineers and are especially interesting when they occur in major reliability practitioners like Google and Amazon. These events tell us that even the best reliability practitioners can’t entirely avoid major incidents. So systems to quickly identify and notify of issues, to trace which components are causing the issues and which are impacted, to understand the causes of the issue, and to mitigate the issues before full fixes can be applied, need to be ready and well practiced.

--

--

Jack Shirazi

Working in the Elastic Java APM agent team. Founder of JavaPerformanceTuning.com; Java Performance Tuning (O’Reilly) author; Java Champion since 2005