Amazon Builders’ Library in focus #3: Avoiding fallback in distributed systems

Jan 09 2020

Amazon Builders' Library in focus - Avoiding fallback in distributed systems

In the third of our series of articles, Yan Cui highlights the key insights from the Amazon Builders’ Library article, Avoiding fallback in distributed systems, by AWS Senior Principal Engineer Jacob Gabrielson.

About the Amazon Builders’ Library

The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.

Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.

Avoiding fallback in distributed systems

There are four broad categories of strategies for handling critical failures:

Retry: perform the failed activity again, either immediately or after some delay.
Proactive retry: perform the activity multiple times in parallel, and make use of the first one to finish.
Failover: perform the activity again against a different copy of the endpoint, or, preferably, perform multiple, parallel copies of the activity to raise the odds of at least one of them succeeding. [similar to Jeff Dean’s paper on rapid response time?]
Fallback: use a different mechanism to achieve the same result.

Why Amazon doesn’t use fallbacks:

Fallbacks are hard to test
Fallbacks can fail too. AWS found that investing engineering time in making the primary code more reliable usually raises odds of success more than investing in an infrequently used fallback strategy.
Fallbacks are often not worth the risk. They make undesirable trade-offs against other qualities such as performance, otherwise, we’d use them all the time instead of the primary. Why use a fallback (that’s worse) when something is already going wrong.
- Fallbacks can add unpredictable load as a result of making these undesirable tradeoffs.
Fallback logic can introduce a latent bug because they can go years without being triggered in production.

How Amazon avoids fallbacks:

Improve the reliability of non-fallback cases (e.g. using services with inherently better resilience).
Let the caller handle errors (e.g. CLI & SDK have built-in retry and backoff).
Push data proactively (i.e. push, not pull).
Convert fallback into failover (randomly choosing between fallback and non-fallback paths so they’re not really “fallback” per se).
Ensure that retries and timeouts don’t become fallback (e.g. by doing proactive retry, that is, always make redundant requests so there is no extra load when retry happens).

Read parts 1 & 2 of the Amazon Builders’ Library in Focus series:

Debug fast and move on

Resolve issues 3x faster
Reduce error rate
Speed up development

Start for Free

Exciting news! Lumigo is joining Dash0!

Amazon Builders’ Library in focus #3: Avoiding fallback in distributed systems

About the Amazon Builders’ Library

Avoiding fallback in distributed systems

Get started now

Amazon Builders’ Library in focus #3: Avoiding fallback in distributed systems

About the Amazon Builders’ Library

Avoiding fallback in distributed systems

This may also interest you

Lumigo achieves AWS Lambda Ready designation

Amazon Builders' Library in focus #4: Avoiding insurmountable queue backlogs

Comparing Amazon ECS launch types: EC2 vs. Fargate

Get started now