What Reliability Engineers can learn from Google’s December 2020 OAuth Outage
Five Takeways to Apply to Your Reliability Strategy
This is my detailed analysis of the reliability failure based on details as published in the incident report.
Note the opinions expressed in this blog are solely my own and do not express the views or opinions of any organization I was previously, currently, or will in the future be associated with
If you were using one of many Google™️ tools on Monday, December 14th, 2020, you may well have suddenly found yourself unable to continue as usual. Gmail™️ email service, YouTube™️ video community, and many parts of Google Workspace™️ productivity and collaboration tools all had problems with user login — in some cases, you had problems even if you were already logged in. It wasn’t just native Google apps; if you were using a service that relied on GCP™️ infrastructure platform, like Slack® , you might have found yourself affected.
The outage was reported globally and made headlines in many newspapers. Using the incident details that Google published, we can understand what happened. We have to infer some details, but the main points are as reported in the incident report. How did such a major outage occur at Google, the pioneer of Site Reliability Engineering? For Reliability Engineers, an outage is a learning opportunity, whether that outage occurs in your organization, or elsewhere. And it’s a lot cheaper to learn from other people’s mistakes.
Our protagonists: Quota and User ID
The story of this outage starts in October, two months before the actual outage event in mid-December. Google uses a Quota system to decide service resource allocation. It apportions storage and other resource types, like processing power or memory. Back in October, Google upgraded their Quota system and rolled it out. Parts of the previous Quota system were left in place. It’s not stated whether this was an error or a normal part of upgrading the Quota system, but it would be normal Reliability Engineering practice to have both old and new system running and gradually move services from the old to new, to reduce the risks from the new system having issues.
Google’s User ID service handles authorization from customer-facing services. All services that require sign-in via a Google Account, including Gmail, Workspace, googleapis.com, and many GCP services, including Cloud Console, BigQuery™️ enterprise data warehouse, Google Cloud Storage™️ service, GKE™️ software service, use the User ID service. User ID includes a distributed database which stores the leases it provides, together with the lease expiry. Any Google service sending a request to authorize a user obtains a lease from User ID (assuming the user is authorized) which is valid for a short period. The distributed database will be updated with the authorization lease and expiry and, when a lease expires, authorization will be denied until a new lease is requested and granted. To grant the lease, User ID needs to successfully store it to the distributed database first, before handing it out to the requesting service.
The Unrecognised Failure in October
Sometime in October, after the new version of the Quota system was available, the User ID service was registered with that new version. The correct registration procedure should have deregistered the User ID service from the parts of the old Quota system still present. However, that didn’t correctly happen, and User ID service was not fully deregistered.
Reliability Engineering Concern 1: Lack of notification for incorrect configuration. There was a monitoring or rollout (or both) failure in a system change that left the system in an incorrect configuration but with no notification that this was the case. In this case, if having both old and new versions of a system running was expected, there should have been some kind of reconciliation procedure that would have identified and notified about services which are registered to both old and new versions of a system. On the other hand, if having both old and new versions of a system running was an error, alerts should have fired. Google noted this in their incident review and that they would “improve monitoring and alerting to catch incorrect configurations sooner”
Critically, although the User ID system was still registered with parts of the old Quota system, it was no longer reporting its current resource usage to any part of the old Quota system. This meant the old Quota system considered the User ID service to be requiring zero resources. However, the old Quota system could also see that despite apparently wanting zero resources, there were resources allocated to the User ID service. The old Quota system naturally saw this as an opportunity to improve resource allocation — as far as it was concerned, the User ID service was holding resources that it wasn’t using and thus no longer needed. Consequently, the old Quota system scheduled a reduction in the resources allocated to the User ID service.
Reducing resources sounds like something which could lead to serious issues if it’s an error, and Google recognized that, so had implemented a good Reliability Engineering pattern of enforcing a grace period, which seems to have been several weeks in this case, before making a potentially bad automated change, presumably accompanied by at least one notification to the service owners. Proceeding after a grace period is a reasonable practice: there would be no reason to wait for manual confirmation to proceed, as for example a service that was abandoned as no longer useful might have no owner to respond to a notification. It’s unfortunate, but common, that notifications are often ignored.
There should have been an urgent alert that a very large quota reduction was planned. In fact, with another good Reliability Engineering practice, Google did indeed have the perfect check in place exactly for this kind of issue: “Excessive quota reduction to storage systems.” Sadly, it was implemented only for checking after an actual change was applied, not for when the change was planned, so it didn’t trigger until much later when the change was applied and the outage had already started.
There were several other existing alerting checks that could have been applicable but also failed to trigger for various reasons:
- A check for “lowering quota below usage”. Another perfect alert that would have prevented the subsequent issue — that didn’t get triggered because as explained earlier, the User ID service resource usage was incorrectly being reported as zero to the old Quota system, and the quota was not being lowered below zero!
- A check for “Low quota”. This didn’t get triggered, Google says, because “the difference between usage and quota exceeded the protection limit”. I don’t fully understand this, but it sounds like if there’s a really large change between quota and usage (here quota would be all the resources the User ID service currently used, and usage would be zero), the alert is suppressed. Presumably because too large a change seems more likely an error in the alerting system than in the actual system.
- A check for “Quota changes to large number of users”. This sounds like it should be applicable because this quota change affected every Google customer in the world that needed authorization! But it seems like the check was actually applied against the owners of the services affected, in this case just the User ID service owners, so just one group of users. Hence not triggered.
Finally, one quiet morning in December: Boom
Several weeks later on December 14th, the grace period expired. The User ID resource quota reduction started. The particular resource reduction that immediately impacted the User ID service’s functionality was likely a storage reduction, though this is not precisely defined in the post mortem.
The immediate impact was that the User ID service’s distributed database was unable to write new records. As mentioned earlier, the User ID service has short leases. As soon as a lease expires, Google services will request a new lease. With database writes disabled, no new leases could be successfully created — and all authorization requests failed.
Reliability Engineering Concern 2: Incorrect resilience to database write failure. It’s clear that having all authorization fail because of an inability to store the lease expiry is distinctly suboptimal. Google noted this in their incident review and promised to “evaluate and implement improved write failure resilience into our User ID service database”
Then, all authorization requests failed, for every Google service, globally.
Reliability Engineering Concern 3: Simultaneous global change procedure. One of the highest priority Reliability Engineering directives is to minimize blast radius and keep errors local. The epitome of this directive is that an error in one region should never affect other regions. To support this, changes should be made successively across regions rather than to all regions at the same time. The original changes to the Quota system, and the registering of the User ID service to the new Quota system would all almost certainly have happened region by region, and only rolled out to further regions after it was clear there were no issues from the changes. As the original bug was never detected during these region changes, the change continued to be rolled out to all regions, resulting in the situation where every region could be simultaneously affected. However this simultaneous multi-region global failure wasn’t only because the bug was present in all regions, the critical failure was to apply the quota restriction simultaneously to all regions. As a follow up item, Google intends to “review (their) quota management automation to prevent fast implementation of global changes”
Denouement
The actual outage was handled quite efficiently. The quota change was applied and capacity alerts fired. 3 minutes later the User ID service errors triggered yet more alerts; after another minute SRE was paged. In a half hour, the cause was identified and, within the hour, Google tried a workaround: disabling the quota enforcement in one region. The patch was immediately successful so 5 minutes later Google applied the same workaround globally.
Any serious outage tends to highlight opportunities for better Reliability Engineering. There are the immediate causes of the outage: typically the more serious the outage, the more things that went wrong. But there are also indirect effects, services that are more strongly dependent on things than they should, operations that fail and should be automatically recoverable but aren’t.
Reliability Engineering Concern 4 — Services with hard dependencies that should be soft dependencies. Lots of Google Cloud services were impacted more extensively than they should have been because of unexpected dependencies.
Reliability Engineering Concern 5 — Recoverable operations failed to recover. Uploads to Google Cloud Storage that were supposed to be resumable failed permanently to be resumable if they were attempted during the outage.
If a major outage can happen at Google, it can happen anywhere. The best you can do is learn from outages, whether yours or others, so that you minimize impact. Take these findings and make your systems more reliable; benefit from Google’s lessons.