With AWS Lambda, we get blue-green deployment out of the box. After we update our code, requests against our function would be routed to the new version. The platform would then automatically dispose of all containers running the old code to free up resources.
This is great, but often it is still not enough. When the traffic is switched over to the new code, any uncaught bugs can impact all users at the same time. This is risky and we often want to limit the blast radius of these uncaught bugs.
Canary deployments help us in these situations. With a canary deployment, the new code is made available to only a small percentage of users first, as our “canary in a coal mine”. We will monitor the health of the new code in terms of performance and error rate. We will route the rest of the users to the new code only when we are satisfied that it is working and performing as we expect.
AWS Lambda has built-in support for canary deployments through weighted aliases and CodeDeploy.
With a weighted alias, you can control and route traffic to two versions of the same function based on your configured weighting.
Just as every version of your function has a unique Amazon Resource Name, or ARN, aliases have unique ARNs too. To make use of a weighted alias, you need to make sure that your event source (e.g. API Gateway) references the ARN of the alias.
During a canary deployment, we need to monitor the system and adjust the traffic routing configuration only when we are satisfied the new code is working properly. If the performance or health of the system degrades, then we need to stop and rollback the change before it impacts any more users. CodeDeploy can automate the entire process for us, and integrates directly with both CloudWatch and Lambda’s weighted alias.
To enable automatic rollback, you need to configure CloudWatch alarms for the deployment. If any of the alarms are triggered during the deployment, then the current deployment would be stopped and rolled back to the previous version.
When you deploy with CodeDeploy, you can choose from a number of pre-built configurations. For example, route 10% of traffic to the new code first, then route the rest after 5 minutes if CloudWatch alarms are not triggered. This is the classic Canary deployment scenario where traffic is shifted in two increments.
This is another variant of this approach, which CodeDeploy calls Linear deployment where 10% of traffic is shifted to the new code every 1, 2, 3 or 10 minutes. If the CloudWatch alarms are triggered at any point during this process, then the whole deployment is stopped and rolled back. That is, 100% of the traffic would be going to the old code once the rollback operation is complete.
I think these built-in tools are good enough for most people’s use cases. However, Lambda’s weighted aliases route traffic by request, not by user. That is a subtle, but important difference in at least two ways.
First, you cannot predicate how the requests are distributed amongst the users. In the below example, we received a total of 41 requests from 5 concurrent users.
If 10% of those requests all came from the same user, then our blast radius is one out of five users. This is what we hope to achieve with canary deployments – to minimize the blast radius of any uncaught bugs that make their way into production.
But it’s just as likely for those 10% of requests to originate from four different users. In which case our blast radius is now four out of five users, or 80% of the active users. This is clearly unacceptable for systems that has to deal with a large number of active users.
As our systems scale, the impact on customers and the cost of these uncaught bugs go up as well. Given the frequency of deployments at Netflix, imagine the volume of customer complaints it will receive if every bug has the potential to reach 80% of active users right away. The reputational cost and the burden on their customer support team are just too great.
This is why, the traffic routing needs to be done at the user level, not individual requests.
Another consequence of routing by request instead of user is that there is no way to propagate the routing decision along the call chain. This impacts you when multiple functions are involved and chained together through some means. Each function would route traffic between old and new code by request independently.
This opens you up to problems related to compatibility between different versions of your code. Imagine a food ordering system, where the order flow is implemented in an event-driven fashion. There is an API
To implement a new feature, the API function records additional information in the order_placed event. One of the Kinesis functions depends on this information and cannot function without it. All of your tests (unit, integration and acceptance) are executed as part of your pipeline and everything works as expected. Now it’s time to deploy to production.
As part of the deployment, every function in this project is configured with 10% canary deployment over 10 minutes. However, because every function would route traffic between v1 and v2 independently, you now have a problem. v1 of the API function does not record the necessary information in the order_placed event and causes a downstream Kinesis function to fail.
The number of possible permutations of these two versions is N factorial, where N is the number of functions involved in the chain. But we know from Psychology that the average human has a working memory capacity of only 7 ± 2 items. So it goes without saying that debugging this type of issues is going to be very difficult.
When you are using a weighted alias, the CloudWatch metrics are not tracked against the specific version that was used. The metrics would report a dimension for the alias, but not the version. This means we are not able to monitor and isolate problems to the new code. It can lead to false positives triggering unnecessary rollbacks.
Similarly, it can also mask performance issues with the new code. If the new code (v2) is performing poorly compared to the current production code (v1), as determined by the respective 95th or 99th percentile latencies. The fact that v2 only accounts for 10% of the traffic means its performance woes become less obvious when we look at the overall latency metric for the weighted alias.
If the aforementioned limitations are a show stopper for you, then here are two possible alternatives for you to consider.
The simplest alternative is to move the routing to the client. In this setup, you will deploy the new code under a different entrypoint. It might be a different domain altogether (e.g. canary.example.com), or a versioned path (e.g. example.com/v2/my-endpoint).
You then need to give the client application a way to discover:
This might look a lot like how you would set up an A/B test. So instead of implementing it yourself, another option is to piggyback off third-party services that support this workflow.
LaunchDarkly is the best known service for implementing feature toggles and can be used to support A/B tests. However, application server would traditionally keep a live socket connection to LaunchDarkly. This is how they discover changes to feature toggle settings from the control plane. Further investigation is needed to see how feasible it is to use LaunchDarkly from AWS Lambda.
In summary, we discussed in this post:
My general feeling is that despite their
shortcoming, weighted alias and CodeDeploy is still good enough for most use
cases. They offer a much needed capability for many organizations and we are
certainly much better off than not having them at all. The goal of this post is
to help you understand where they fall short so you can plan ahead accordingly
as your needs grow. Please let us know in the comments below what other
approaches you are aware of and if you would like us to investigate the
integration path between AWS Lambda and LaunchDarkly.