In the first article in our new series, Yan Cui highlights the key insights from the Amazon Builders’ Library article, Timeouts, retries and backoff with jitter, by AWS Senior Principal Engineer Marc Brooker.
Thanks to Ryan Scott Brown of Trek10 for contributing towards this summary.
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Set timeout on any remote call, and any call across processes on the same server. This includes both the connection timeout and request timeout. However, it’s difficult to pick the right timeout value, and setting it too low can trigger a retry storm.
Amazon chooses an acceptable rate of false timeout (e.g. 0.1%) and uses the corresponding percentile latency for the downstream as the timeout.
YC: for Lambda, we can find out the amount of time left in the current invocation with context.getRemainingTimeInMillis(). We can use this info to set a dynamic timeout value on these remote/cross-process calls.
Also, where possible, instead of building their own timeout mechanism, Amazon prefers to use timeout mechanisms that are built into well-tested clients (e.g. HTTP clients).
Use exponential backoff between retries, but cap the backoff to a maximum value to avoid retrying for too long, aka “capped exponential backoff”. Imagine allowing for 10 retries and using exponential backoff. 2**10 seconds is almost a half-hour if you don’t cap the backoff to some maximum value.
In a system with many layers, retrying at all layers might not be desirable as it multiplies the number of retries. In general, for low-cost control-plane and data-plane operations, Amazon’s best practice is to retry at a single point in the stack.
Use circuit breakers to give failing systems a chance to recover. But they make testing more difficult and can add time to recovery. A mitigating strategy is to use a token bucket, which has been built-into the AWS SDK since 2016.
APIs with side effects aren’t safe to retry unless they provide idempotency. For example, if you were to retry billing a customer without proper idempotency you could bill them several times accidentally.
Know which errors to retry, e.g. HTTP client errors (4XX errors) are not worth retrying at all.
To avoid all clients retrying at the same time, inject jitter to the backoff. Marc Brooker’s fav is:
randint(0, base * 2 ** attempt)
See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ for a detailed explanation, or watch this section of the talk about backoff, retries, and jitter from re:Invent 2019.
Jitters are not just for retries, consider adding jitter to all timers, periodic jobs or delayed work. Every scheduled job doesn’t need to run at the top of the hour, but most humans will choose a time like “4am on Wednesday” anyway, causing scheduled jobs to cluster around the beginning of the hour.
When adding jitter to scheduled jobs, use a consistent method to add jitter on the same host so when the system is overloaded there’ll be an easy-to-spot pattern in the behavior.
Read part 2 of the Amazon Builders’ Library in Focus series – ‘Using load shedding to avoid overload’ here.