Next in our series on the Amazon Builders’ Library, Yan Cui picks out the key insights from the article, Static stability using availability zones, by AWS Senior Principal Engineer Becky Weiss and AWS Principal Engineer Mike Furr.
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Control plane = changes to a system (e.g. adding resources) and propagating the changes.
Data plane = daily business of those resources – what it takes for them to function.
Separate the data plane and control plane, because:
The data plane usually receives data from the control plane but maintains its own state so it can continue working even when the control plane is impaired.
One lesson Amazon learned is to expect impairments before they happen. A statically stable service would continue to function in the face of partial impairment (e.g. losing an AZ) or impairment to its dependencies.
Reacting to impairments as they happen (e.g. if one AZ fails other AZs would scale up to take over the load) is less effective because the response to impairment requires actions from the control plane. Control planes are typically more complex and more likely to misbehave when the overall system is impaired. A statically stable service would over-provision to the point where it doesn’t need to launch any EC2 instances even if one AZ is impaired.
The rest of the article then goes deeper into how static availability is applied in EC2:
You can use the aforementioned active-active pattern to build highly available regional services. You can then stack these services on top of each other. This regional-calls-regional pattern is one Amazon uses for many of its services – both external-facing as well as internal.
But for foundational services – services that are building blocks for other services such as EC2 – Amazon designs them to be AZ independent instead.
This is why EC2 NAT Gateway is a zonal resource. AZ independence is important here because NAT Gateway sits in the path of internet connectivity and is, therefore, part of the data plane for any EC2 instance in the VPC.
To allow customers to build highly available regional services, Amazon needs to ensure AZ impairments are contained and do not spread out to other AZs. Which is why all foundational components such as NAT Gateway needs to stay within an AZ.
The tradeoff for this design decision is the additional complexity involved in managing zonal (rather than regional) service configurations. E.g. multiple NAT gateways and routing tables.
Amazon also periodically stores database backups in S3 and keep read replicas across multiple AZs. This is to ensure they store customer or business-critical data durably.
Read parts 1-4 of the Amazon Builders’ Library in Focus series: