Yan Cui
Jan 09 2020
In the third of our series of articles, Yan Cui highlights the key insights from the Amazon Builders’ Library article, Avoiding fallback in distributed systems, by AWS Senior Principal Engineer Jacob Gabrielson.
About the Amazon Builders’ Library
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Avoiding fallback in distributed systems
There are four broad categories of strategies for handling critical failures:
- Retry: perform the failed activity again, either immediately or after some delay.
- Proactive retry: perform the activity multiple times in parallel, and make use of the first one to finish.
- Failover: perform the activity again against a different copy of the endpoint, or, preferably, perform multiple, parallel copies of the activity to raise the odds of at least one of them succeeding. [similar to Jeff Dean’s paper on rapid response time?]
- Fallback: use a different mechanism to achieve the same result.
Why Amazon doesn’t use fallbacks:
- Fallbacks are hard to test
- Fallbacks can fail too. AWS found that investing engineering time in making the primary code more reliable usually raises odds of success more than investing in an infrequently used fallback strategy.
- Fallbacks are often not worth the risk. They make undesirable trade-offs against other qualities such as performance, otherwise, we’d use them all the time instead of the primary. Why use a fallback (that’s worse) when something is already going wrong.
- Fallbacks can add unpredictable load as a result of making these undesirable tradeoffs.
- Fallback logic can introduce a latent bug because they can go years without being triggered in production.
How Amazon avoids fallbacks:
- Improve the reliability of non-fallback cases (e.g. using services with inherently better resilience).
- Let the caller handle errors (e.g. CLI & SDK have built-in retry and backoff).
- Push data proactively (i.e. push, not pull).
- Convert fallback into failover (randomly choosing between fallback and non-fallback paths so they’re not really “fallback” per se).
- Ensure that retries and timeouts don’t become fallback (e.g. by doing proactive retry, that is, always make redundant requests so there is no extra load when retry happens).
Read parts 1 & 2 of the Amazon Builders’ Library in Focus series: