Amazon Builders' Library in focus #4: Avoiding insurmountable queue backlogs
In the latest article in our series focusing on the Amazon Builders’ Library, Yan Cui highlights the key insights from Avoiding insurmountable queue backlogs by AWS Principal Engineer David Yanacek.
About the Amazon Builders’ Library
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Avoiding insurmountable queue backlogs
How we measure availability and latency
For SQS, the no. of messages going into the DLQ is a good measure of availability. Similarly, we can use message age to measure latency.
Backlogs in multi-tenant systems
When implementing multi-tenancy systems, you need to add fairness throttling. No customer should be able to monopolise the available resources and affect other customers’ workloads.
Amazon’s strategies for building multi-tenant systems
Separating workloads into separate queues – e.g. one queue per customer
Shuffle-sharding – e.g. Lambda has a fixed no. of queues, and hashes each customer to a small subset of them
Sidelining excess traffic to a separate queue – e.g. move excess traffic from a customer to a spillover queue for later processing
Sidelining old traffic to a separate queue – callers might have given up on those old messages, so it’s better to focus on fresh messages
Dropping old messages by specifying message time-to-live
Limiting threads (and other resources) per workload
Sending backpressure upstream – aka. load shedding, but this is not always easy to do in a multi-tenant queue, or even appropriate to do for the application (e.g. in order processing systems, better to accept a backlog than to drop new orders)
Using delay queues to put off work until later – move workload into a surge queue with message delay so we can focus on fresh messages
Avoiding too many in-flight messages – when dealing with overloads, prefer moving excess traffic to separate queue instead
Using DLQs for messages that can’t be processed
Ensuring additional buffer in polling threads per workload – leave headroom in the no. of pollers for spikes and measure no. of empty receives
Heartbeating long-running messages – when the system is overloaded, latencies tend to go up and messages can become visible again after visibility timeout. When that happens we will essentially fork-bomb ourselves
Plan for cross-host debugging – X-Ray, correlation IDs, etc. or use Step Functions for complicated async workflows
Read parts 1-3 of the Amazon Builders’ Library in Focus series: