In the latest article in our series focusing on the Amazon Builders’ Library, Yan Cui highlights the key insights from Avoiding insurmountable queue backlogs by AWS Principal Engineer David Yanacek.
About the Amazon Builders’ Library
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Avoiding insurmountable queue backlogs
How we measure availability and latency
For SQS, the no. of messages going into the DLQ is a good measure of availability. Similarly, we can use message age to measure latency.
Backlogs in multi-tenant systems
When implementing multi-tenancy systems, you need to add fairness throttling. No customer should be able to monopolise the available resources and affect other customers’ workloads.
Amazon’s strategies for building multi-tenant systems
- Separating workloads into separate queues – e.g. one queue per customer
- Shuffle-sharding – e.g. Lambda has a fixed no. of queues, and hashes each customer to a small subset of them
- Sidelining excess traffic to a separate queue – e.g. move excess traffic from a customer to a spillover queue for later processing
- Sidelining old traffic to a separate queue – callers might have given up on those old messages, so it’s better to focus on fresh messages
- Dropping old messages by specifying message time-to-live
- Limiting threads (and other resources) per workload
- Sending backpressure upstream – aka. load shedding, but this is not always easy to do in a multi-tenant queue, or even appropriate to do for the application (e.g. in order processing systems, better to accept a backlog than to drop new orders)
- Using delay queues to put off work until later – move workload into a surge queue with message delay so we can focus on fresh messages
- Avoiding too many in-flight messages – when dealing with overloads, prefer moving excess traffic to separate queue instead
- Using DLQs for messages that can’t be processed
- Ensuring additional buffer in polling threads per workload – leave headroom in the no. of pollers for spikes and measure no. of empty receives
- Heartbeating long-running messages – when the system is overloaded, latencies tend to go up and messages can become visible again after visibility timeout. When that happens we will essentially fork-bomb ourselves
- Plan for cross-host debugging – X-Ray, correlation IDs, etc. or use Step Functions for complicated async workflows
Read parts 1-3 of the Amazon Builders’ Library in Focus series: