When we invoke a Lambda function, the AWS Lambda service creates an instance of the function having an execution environment. It runs the handler method implemented by function to process the event that triggered it. Once, the process is complete, the instance environment doesn’t go down immediately. It waits for the next events to process for 30 minutes before it shuts down the instance with no activity.
However, in real-world applications, we get a lot of requests in milliseconds time span. And, there will be a situation, where an instance is serving an event while other invocations are requested in parallel. In that case, Lambda initializes another instance to handle the additional requests. As more events come in, Lambda initializes more instances and routes the requests based on instances’ availability. When the numbers decrease, Lambda starts to scale down by stopping the unused instances.
In Lambda, the number of instances that serve the request at a given time is known as Concurrency. However, the bursting of these instances cannot be infinite. It starts with an initial burst ranging from 500 to 3000 depending on the Region where Lambda function runs.
Burst concurrency limits
After the initial burst, it can scale further by 500 instances per minute until it is sufficient to serve all the requests or max concurrency limit is hit. When the throughput is more than the instances can be scaled, they will start erroring out with throttling error code (429).
Concurrency limits are defined at two levels:
Account – It is by default 1000 for an account per region. It can be extended by requesting if from AWS.
Function – It has to be configured at the level of each function. If not defined, it will use at the account-level concurrency limit. You can define up to 900 limits at function-level as the remaining 100 are left for those functions that didn’t define the concurrency limit. It is always recommended to define the Function level limit so that one function’s unreasonable scaling doesn’t impact the other functions in the same account.
Lambda concurrency is handled automatically by AWS so developers don’t have to worry about how to do it. However, we need to understand how it works so that we can configure it properly to avoid problems introduced by it.
Let’s first talk about what problems it creates, namely cold starts:
When we invoke the Lambda for the first time, it downloads the code from S3, downloads all the dependencies, creates a container, and starts the application before it executes the code. This whole duration (except the execution of code) is known as the cold start time.
Now, think of a scenario, where we have a high volume of requests in the morning, Lambda will keep spinning up new instances to serve all the requests. As each instance takes time, typically 30 secs to 1 minute, it will have a latency impact on these requests’ responses. It may also timeout, as API Gateway has a 29-second max timeout limit. It greatly impacts the user experience when there is a sudden spike in requests in a short period of time.
Adding to the above scenario, let’s say the Lambda is hosted to connect VPC resources, before November 2019, it’s scaling used to depend on a number of ENIs available in VPC as for every invocation of the function, needs ENI to be created on the fly and creates a connection between your own VPC and the Lambda VPC.
Also, creating the ENI while invocation is happening adds to the latency of the cold start and impacts the overall response time for a request. So, Lambda invocation in VPC used to be a very costly affair and not recommended unless absolutely required.
However, AWS has come forward with many solutions to reduce cold starts and latency such as provisioned concurrency, Improved network performance of VPC for Lambda Functions, and Application autoscaling for Lambda Functions.
As discussed in this article, there will be scenarios where the number of requests will be more than an instance of a Lambda function can handle and that would require to spin up additional instances. However, spinning up additional instances at runtime adds to the cold start time.
Using provisioned concurrency, we can define how many instances should be kept warm while creating the function itself. This ensures that we have configured instances ready to serve all the requests. Provisioned concurrency can be applied to an alias or a version of the function.
Once all the configured instances (in this example, 5) are fully used and still more requests are supposed to be served, then Lambda will spin up new instances based on the reserved concurrency defined at the function or account level.
By default, the Lambda service is not configured to connect your own VPC resources. The Lambda service runs in a different VPC managed by AWS. It can access any resource available over the Internet. However, it cannot connect to the private resources of a VPC such as RDS or EC2 (in a private subnet).
To enable the Lambda service to connect these other VPC private resources, it creates an ENI (Elastic Network Interface) and does a cross-account attachment. This allows Lambda to have network access to private resources. This ENI is required for each Lambda invocation, so the total number of ENIs required depends on the function configuration and concurrency. And, there is always a limit for the number of ENIs you can create in a VPC.
This design creates two major issues:
To address this issue, AWS launched Hyperplane, which provides NAT capabilities from the Lambda VPC to the customer’s VPCs. Instead of the earlier solution of mapping network interfaces in your VPC directly to Lambda execution environments, network interfaces in your VPC are mapped to the Hyperplane ENI and the functions connect using it.
This solution brings a few benefits –
Remember, as Hyperplane creates a network interface while creating Lambda functions, it may take up to 90 seconds and may cause a delay in the response of requests that occur while it is being created.
AWS Lambda also has integration with Application Auto-Scaling. Application Auto-Scaling is a web service that enables automatic scaling of AWS resources. You can also configure Application Auto-Scaling to manage provisioned concurrency of a Lambda function.
There are two methods of scaling: schedule-based and utilization-based. If you have a use case in which you can anticipate the peak traffic, use schedule-based scaling. Otherwise, use utilization-based scaling. To increase provisioned concurrency based on the need at runtime, you can use the Application Auto Scaling API to register a target and create a scaling policy.
With the recent upgrade, AWS Lambda now supports measuring the sum of all concurrent executions for a Lambda function. Earlier it only worked at the aggregate level of all the functions in an account. It also supports collecting metrics for all versions and aliases of a function. Below are some of the main metrics related to concurrency:
Application scaling is a very important feature for any cloud application, but when it comes to serverless, it’s a must. AWS Lambda has evolved over time and has come up with many features to support scaling, concurrency, and reduce cold start time. It has also greatly improved the networking between the Lambda VPC and the customer VPC.