What Reliability Engineers can learn from Amazon’s November 2020 Kinesis Outage
Six Takeways to Apply to Your Reliability Strategy
This is my detailed analysis of the reliability failure based on the details in the published incident report.
Note the opinions expressed in this blog are solely my own and do not express the views or opinions of any organization I was previously, currently, or will in the future be associated with
If your workloads or services were using AWS US-east-1 region on November, 25th 2020, you might have found them degraded or unavailable. Kinesis, Cognito, CloudWatch, AutoScaling, Lambdas, all provided degraded service to AWS customers. In this article I use the incident details that Amazon published, to understand what happened. I have to infer some details, I may have gotten some of the inferred fine detail wrong but the main points are as reported in the incident report.
1. The Kinesis Front-End Servers
AWS Kinesis consists of back-end servers that provide the main Kinesis streaming features, and front-end administrative servers which handle authentication, throttling, and request-routing. The critically important part of the Kinesis architecture for this outage, is that every front-end server connects to every other front-end server so that they can communicate administrative changes to all front-end servers. This design is not a problem. However the implementation of the communication mesh between all the front-end servers was one OS thread per connection.
Reliability Engineering Concern 1 — One connection per (OS) thread is a known scaling limitation pattern. This is well known and given Amazon’s huge scaling infrastructure, would be well flagged internally as an anti-pattern, so it’s likely this part of the Kinesis front-end was implemented long before extensive scaling was considered for this product, and then forgotten about. That’s not unusual in the tech industry — and suggests that it is good practice to do design reviews as we scale up services
OS threads are a limited resource, it’s more normal for highly scaled services to use multiplexing implementations for connections, which limits the threads needed. So the design of the one connection-one thread was essentially a long standing scaling bug in the front-end servers, waiting for the wrong conditions to trigger it.
2. The Cost of Success
The wrong condition in this case was success. Kinesis is a very successful AWS product and continues to grow usage. As usage grows, Kinesis needs to scale up to handle that growth. At some point, the number of front-end servers was inevitably going to reach above the number of threads the OS can provide — and that is exactly what happened on November 25th. Amazon increased capacity by adding servers to the Kinesis front-end fleet, taking the thread limit needed by each server beyond the OS capacity.
Reliability Engineering Concern 2 — Not monitoring or alerting on a critical shared resource (the thread count per OS). The OS thread count was close to the OS limit of threads per host. There are many resources to monitor, but there are some critical resources — eg CPU, Memory, IO, Storage, Bandwidth, Uptime, Processes, Threads, Filehandles, Paging — which should always be in the list because they are common causes of resource exhaustion
After adding new servers, the front-end servers now encountered errors when trying to increase their thread count to be able to connect to the new added servers. That shouldn’t have been a major issue, having some servers unable to talk to a small subset should lead to degraded routing and admin capabilities. But part of the path that was now being errored included building the routing map for the back-end shards. This left that mapping corrupt, which meant Kinesis as a whole started degrading.
Reliability Engineering Concern 3— No fallback in a critical procedure, building the shard-map. You should regularly review your services to identify critical components that, should they fail, will cause a failure of the whole service. On identifying such a component, you should re-design it to have a fallback plan for it to provide a degraded service rather than totally fail
3. The Important First Step: Getting Things Working
The primary focus of Reliability Engineering operations during an incident is to mitigate the problem, not to fix it. Identifying the underlying problem and fixing it is usually a time and people consuming activity. During the incident it’s more important to find a workaround so that the system can resume service as close to normal as possible, then analyse more deeply afterwards.
Reducing the capacity back to where it was — rolling back the change — was the obvious first step. If the routing maps were not corrupted, that would have worked nicely. Sadly, although reducing capacity meant the servers would no longer need to try and create too many threads, the routing map corruption remained. The next step would be to turn the servers off and on again, so that they could restart and build a clean routing map. Unfortunately, the route map building is resource intensive and competes with the request handling, which meant that only a few servers could restart at a time or the whole Kinesis system would return errors to a large percentage of requests, making the situation even worse.
Reliability Engineering Concern 4&5 — Slow service startup time and startup dependent on slow dependencies. This is a well known concern in Reliability Engineering. When everything is working well, it’s fine for services to start slowly because there are other servers able to handle the capacity while they start. But when things go wrong, slow starting services make the whole situation worse. In this case Amazon quickly recognized this, saw a quick solution (starting the servers on a statically defined good route map and letting them update gradually as normal), and patched the servers to do that
4. Other Products
Kinesis wasn’t the only product that was impacted, several others were, including services which shouldn’t have been impacted or at worst should have provided a degraded service with their core features fine and peripheral features unavailable. Unfortunately there were unexpected and unnecessary hard dependencies on Kinesis that made them fail.
Reliability Engineering Concern 6 — Unintended dependency on a non-critical feature causing failure in critical service. For example Cognito failed — it’s not supposed to be dependent on Kinesis to operate but uses Kinesis for some best-effort add-on reporting. However it had an unrealized hard dependency on Kinesis
Amazon identified one further reliability improvement they could make — one that is only really necessary for very high reliability: using a cell architecture to provide bulkheads. Most companies would do this by having regions able to operate independently, so if one region fails, others continue to operate independently with no issues. Amazon often recommends doing the same at the Availability Zone level (which is similar to a Data Centre).
For Reliability Engineers, an outage is a learning opportunity, whether that outage occurs in your organization, or elsewhere. And it’s a lot cheaper to learn from other people’s mistakes.