Amazon Builders' Library in focus #4: Avoiding insurmountable queue backlogs

Jan 16 2020

Feature image for part 4 of the Amazon Builders' Library in focus article

In the latest article in our series focusing on the Amazon Builders’ Library, Yan Cui highlights the key insights from Avoiding insurmountable queue backlogs by AWS Principal Engineer David Yanacek.

About the Amazon Builders’ Library

The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.

Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.

Avoiding insurmountable queue backlogs

How we measure availability and latency

For SQS, the no. of messages going into the DLQ is a good measure of availability. Similarly, we can use message age to measure latency.

Backlogs in multi-tenant systems

When implementing multi-tenancy systems, you need to add fairness throttling. No customer should be able to monopolise the available resources and affect other customers’ workloads.

Amazon’s strategies for building multi-tenant systems

Separating workloads into separate queues – e.g. one queue per customer
Shuffle-sharding – e.g. Lambda has a fixed no. of queues, and hashes each customer to a small subset of them
Sidelining excess traffic to a separate queue – e.g. move excess traffic from a customer to a spillover queue for later processing
Sidelining old traffic to a separate queue – callers might have given up on those old messages, so it’s better to focus on fresh messages
Dropping old messages by specifying message time-to-live
Limiting threads (and other resources) per workload
Sending backpressure upstream – aka. load shedding, but this is not always easy to do in a multi-tenant queue, or even appropriate to do for the application (e.g. in order processing systems, better to accept a backlog than to drop new orders)
Using delay queues to put off work until later – move workload into a surge queue with message delay so we can focus on fresh messages
Avoiding too many in-flight messages – when dealing with overloads, prefer moving excess traffic to separate queue instead
Using DLQs for messages that can’t be processed
Ensuring additional buffer in polling threads per workload – leave headroom in the no. of pollers for spikes and measure no. of empty receives
Heartbeating long-running messages – when the system is overloaded, latencies tend to go up and messages can become visible again after visibility timeout. When that happens we will essentially fork-bomb ourselves
Plan for cross-host debugging – X-Ray, correlation IDs, etc. or use Step Functions for complicated async workflows

Read parts 1-3 of the Amazon Builders’ Library in Focus series:

Debug fast and move on

Resolve issues 3x faster
Reduce error rate
Speed up development

Start for Free

Lumigo Launches AI Agent Observability

Amazon Builders' Library in focus #4: Avoiding insurmountable queue backlogs

About the Amazon Builders’ Library

Avoiding insurmountable queue backlogs

How we measure availability and latency

Backlogs in multi-tenant systems

Amazon’s strategies for building multi-tenant systems

Get started now

Amazon Builders' Library in focus #4: Avoiding insurmountable queue backlogs

About the Amazon Builders’ Library

Avoiding insurmountable queue backlogs

How we measure availability and latency

Backlogs in multi-tenant systems

Amazon’s strategies for building multi-tenant systems

This may also interest you

Amazon Builders’ Library in focus #3: Avoiding fallback in distributed systems

Amazon Builders’ Library in focus #2: Using load shedding to avoid overload

Amazon Builders’ Library in focus #1: Timeouts, retries, and backoff with jitter

Get started now