In our latest article on the Amazon Builders’ Library, Yan Cui highlights the main takeaways from the article, Implementing health checks, by AWS Principal Engineer David Yanacek.
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Control plane = changes to a system (e.g. adding resources) and propagating the changes
Data plane = daily business of those resources – what it takes for them to function
Many load balancers use “least request” algorithm. An unhealthy server that fails requests quickly can attract more requests than healthy servers. You can prevent this “black hole” by slowing down failed requests to match the average latency of successful requests.
Servers can fail independently (corrupt disk, memory leak, etc.) or collectively (e.g. outage to shared dependencies or network issues). Health checks that fail for non-critical reasons can be dangerous – if the non-critical failure is correlated across servers (e.g. shared dependency) then it can kill the entire fleet.
There’s tension between a thorough health check that quickly mitigates single-server failures and the harm of a false positive that affects the whole fleet. In general, automation around health checks should stop traffic to a single bad server but keep serving traffic if the entire fleet appears to have trouble. YC: this might seem counter-intuitive, but if the whole fleet is unhealthy then you’re in trouble anyway, might as well keep going and hope it’s a false positive, right?
A “soft dependency” is a dependency that you call only sometimes. Without fail-open, a health check that tests the health of soft dependencies turns them into “hard dependencies”. If the dependency is down, the service is down, thus creating cascade failures.
It’s rarely a clear-cut decision as to which dependency to health-check, but in general, you should prioritize the availability of data-plane operations. Similarly, you can also use read caches to preserve uptime for read operations even when datastore is down.
There are a number of real-world failures at Amazon in this section. I recommend reading the whole section even if you have skipped the earlier sections.
Read parts 1-5 of the Amazon Builders’ Library in Focus series: