Next in our series on the Amazon Builders’ Library, Mike Roberts of Symphonia picks out the key insights from the article Going faster with Continuous Delivery by AWS senior manager of software development Mark Mansour.
The Amazon Builders’ Library is a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
I’ve been using Continuous Integration and Continuous Delivery techniques for nearly 20 years. My “go to” tool these days for orchestrating CD is AWS CodePipeline. It was fascinating, therefore, to read Going faster with Continuous Delivery to learn about AWS’ own journey into mass adoption of CD within their engineering organization, which itself has led to the development of the CodePipeline service. Here are some of the elements of this article I thought were most interesting.
Let’s start at the end of the article. Mark says that “Amazon is now at the point where teams aim for full automation as they write new code. For us, automation is the only way we could have continued to grow our business.” So here it is – at an organizational level, CD is key to the success of AWS as a business. This breaks down into different aspects. For the customer, “continuous deployment has a positive impact on quality”, and it “allows teams to release frequently”. For the team, AWS has “seen automation give engineers back time by removing frustrating, error prone, and laborious manual work.”
That’s great news! But how did they get here?
Amazon’s journey with CD started over 10 years ago. They found that on average it took “16 days from code check-in to production”, of which 14 days were “spent waiting for team members to start a build, to perform deployments, and run tests”. Using the AWS Leadership Principle of “Highest Standards” as a basis, they decided to automate these processes “to improve our speed of execution … to eliminate delays while maintaining or even improving quality”. Amazon wasn’t starting from scratch however – it already had an automated build system named Brazil, and a deployment system Apollo. What it created was Pipelines to bridge the gap between the two.
Pipelines is an automation system that first builds an artifact, and then runs that artifact through a sequence of steps, each step “increas[ing] [teams’] confidence that the build artifact doesn’t contain defects”.
Here are the different techniques they use for different steps:
Amazon automates “unit, integration, and pre-production testing” within a pipeline. This covers a wide range of types of test, and also includes code coverage, code complexity, load tests and security tests, all in the name of verifying that the build artifact created at the beginning of the pipeline is “functionally correct”. Amazon uses tests that are “smaller and faster” so that teams “receive feedback quicker on any problems” .
A final testing technique before releasing a change to customers is “pre-production testing”. A pre-production environment has precisely the same configuration as a regular production environment, but it only receives traffic from the team that owns the service. This technique “makes sure that the service can correctly connect to all production resources” and it “ensures that the system interacts correctly with the APIs of the production services it depends on.”
The article describes the trickiness of the question “how much testing is enough?”. On one hand, teams want to minimize impact of problems to customers, on the other if Amazon “over invest in testing, then we might not succeed because others have moved faster than us.” Overall though, “Software teams at Amazon have a high bar for testing, and spend a lot of effort on it”.
Once a change is deployed, Amazon “run[s] quick checks that ensure the newly deployed artifact has started and is serving traffic”. For systems using CodeDeploy, they may “use lifecycle event hooks … trigger simple scripts to stop, start, and validate the deployment” . At this stage, if a failure is detected, the pipeline “should roll back the change to minimize the time that customers see a defect”.
Pipelines can be blocked from automatically releasing for various reasons:
If necessary, teams can override some or all of these blocks, e.g. if a release would actually solve the problem being reported.
Amazon gradually releases changes to customers, rather than to all customers at once. They do this by means of cells – each cell is “a completely independent instance of a service”. Essentially, cell-based release is an extension of a “canary” release, where multiple canary stages are used, each with an incrementally higher number of customers. (I’ve also seen this technique referred to by other organizations as ‘deployment rings’). Amazon triggers the next release stage by looking at a count of data points, and/or a specific period of time, and then decides whether to release to the next cell by looking at error rates. Whether this process takes minutes or hours depends very much on each team.
An important nuance with this technique is that Amazon aims “to get from check-in to our first production customer as quickly as possible. However, after we’re in production we slowly release the new code to customers, working to gain additional confidence”.
Even once a change is fully released, Amazon “generate[s] synthetic traffic on our systems” every minute against “all public-facing APIs” “to make sure that our production systems are continuing to serve customer requests”. (Amazon recently released CloudWatch Synthetics – this allows you to more easily setup the same kind of system.)
These techniques are great, but obviously have rolled out across all of Amazon’s teams over a period of time. There has been a process of learning what practices work best organizationally, and for each team. Sometimes these lessons were learned by knowledge sharing, other times by actually encoding the practice, such as certain validations, within the Pipelines tool itself.
While each Pipeline itself represents improvement for the stability of a service, and the speed at which it’s released, the use of Pipelines also drove teams to consolidate their variety of release processes (e.g. for bug fixes versus major features) into one, standardized, process that everyone used. This drove “consistency, standardization, and simplification of a team’s release processes” and subsequently “fewer defects”.
Overall though the use of CD has been and continues to be an iterative process throughout AWS. Some teams started with a pipeline being more of a “a visual interface to their release process without automatically promoting build artifacts” but over time “they gradually turned on automation at different stages of their pipeline until they no longer needed to manually trigger any step of their pipeline.” In other words, full Continuous Deployment and not just Continuous Delivery.
I’ve heard many folks say that Continuous Deployment is a technique only useful for startups. With this article Mark Mansour shows that with the right techniques, tools, culture, and a goal of iterative improvement, even an organization the size of Amazon can sustainably embrace Continuous Deployment to grow their business.
Mike Roberts is a partner and co-founder of Symphonia. Symphonia is an AWS Cloud Technology Consultancy based in New York City, specializing in Serverless, DevOps, and Data Architecture & Engineering. Find out more here, and see Mike’s twitter at @mikebroberts .