Jan 30 2020
In our latest article on the Amazon Builders’ Library, Yan Cui highlights the main takeaways from the article, Implementing health checks, by AWS Principal Engineer David Yanacek.
About the Amazon Builders’ Library
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Implementing health checks
Control plane = changes to a system (e.g. adding resources) and propagating the changes
Data plane = daily business of those resources – what it takes for them to function
Health check tradeoffs
Many load balancers use “least request” algorithm. An unhealthy server that fails requests quickly can attract more requests than healthy servers. You can prevent this “black hole” by slowing down failed requests to match the average latency of successful requests.
Servers can fail independently (corrupt disk, memory leak, etc.) or collectively (e.g. outage to shared dependencies or network issues). Health checks that fail for non-critical reasons can be dangerous – if the non-critical failure is correlated across servers (e.g. shared dependency) then it can kill the entire fleet.
There’s tension between a thorough health check that quickly mitigates single-server failures and the harm of a false positive that affects the whole fleet. In general, automation around health checks should stop traffic to a single bad server but keep serving traffic if the entire fleet appears to have trouble. YC: this might seem counter-intuitive, but if the whole fleet is unhealthy then you’re in trouble anyway, might as well keep going and hope it’s a false positive, right?
Ways to measure health
- Liveness checks – basic connectivity check or check if a process is running, e.g. port 80 returns HTTP 200.
- Local health checks – checks if the application is able to function and tests local resources, e.g. disk space, NGINX process, and monitoring daemons.
- Dependency health checks – checks if the application is able to interact with its dependencies. Ideally, these catch only local problems such as expired credentials but they can also report false positives when there’s a problem with the dependency.
- Anomaly detection – check if any server is behaving oddly compared to the rest of the fleet – error rates, latency, etc. For anomaly detection to work in practice:
- Servers should be doing approximately the same thing.
- The fleet should be relatively homogeneous (same instance type, etc.).
- Errors or difference in behavior must be reported – the client of a service is a great place to add instrumentation, load balancer logs are also useful here.
Reacting safely to health check failures
- Fail open – NLB fails open if no servers are reporting healthy (as discussed above). ALB and Route53 also support this behavior. While Amazon does use this approach, they’re generally skeptical of things they can’t reason or test fully. And fail-open is a bit of a cop-out when you can’t tell the difference between a fleet-wide failure and a false positive due to problems with a shared dependency.
- Health checks without a circuit breaker – where there is no built-in circuit breaker, Amazon’s best practice for setting up health checks is:
- Use liveness and local health checks on the load balancer.
- Use an external monitoring system to perform dependency health checks and anomaly detection. Set up thresholds to stop the automated system from taking drastic actions (like killing the whole fleet) and engage human operators when the thresholds are crossed.
- Prioritize your health – as discussed in the load-shedding post, servers should prioritize health checks over regular work in overload conditions to avoid being marked unhealthy (and make a bad situation even worse).
Balancing dependency health checks with the scope of impact
A “soft dependency” is a dependency that you call only sometimes. Without fail-open, a health check that tests the health of soft dependencies turns them into “hard dependencies”. If the dependency is down, the service is down, thus creating cascade failures.
It’s rarely a clear-cut decision as to which dependency to health-check, but in general, you should prioritize the availability of data-plane operations. Similarly, you can also use read caches to preserve uptime for read operations even when datastore is down.
Real things that have gone wrong with health checks
There are a number of real-world failures at Amazon in this section. I recommend reading the whole section even if you have skipped the earlier sections.
Read parts 1-5 of the Amazon Builders’ Library in Focus series: