Ryan Frantz
AWS Kinesis Outage Analysis

Kinesis Outage

On November 25, 2020, Amazon Web Services (AWS) experienced an outage in its Kinesis product that resulted in several cascading failures in several downstream products. The outage is known to have impact several well-known companies such as Adobe and Roku, at least, and countless customers. Amazon released a summary of the event providing initial details, including their observations, some technical details, and early remediation work.

I read through the summary and made several rough notes that I’ll share here. I’ve been revisiting my thoughts on Donella Meadows’ Systems Thinking in Practice so I’ll link to relevant content about system leverage points in the notes below.

Rough Notes

Kinesis powers a number of other services like Cognito, CloudWatch, and EventBridge.

Adding capacity was a trigger.

A resource limit (thread count on frontend servers) was exceeded.

A number of immediate and forthcoming remediation items have been defined. Several architectural changes will be introduced, which themselves may trigger future outages. Or possibly surfaces other limits.

Kinesis Dependencies

Based on the above notes, here’s a rough diagram of the services that have immediate or secondary (?) dependencies on Kinesis:

AWS Kinesis Dependencies