How to Monitor Lambda with CloudWatch Metrics

With AWS Lambda, you have basic serverless monitoring built into the platform with CloudWatch. CloudWatch is an AWS monitoring platform that offers support for both metrics and logging, as well as the CloudWatch API which provides programmatic access to metrics. CloudWatch monitoring gives you basic metrics, visualization, and alerting while CloudWatch Logs captures everything that is written to stdout and stderr. In this post, we will take a deep dive into CloudWatch Metrics to see how you can use them for AWS Lambda monitoring, and the limitations of built-in AWS performance monitoring tools.

This is part of a series of articles about AWS lambda monitoring

In this article

Key AWS Lambda Metrics to Monitor with CloudWatch

Let’s review the key Lambda metrics you can track using AWS monitoring tools.

AWS Lambda Errors

The errors metric in Lambda count two types of issues:

Uncaught exceptions thrown by your code
Runtime errors, including invalid type, API timeout, or division by zero

CloudWatch gives you the number of function calls that resulted in an error. To fix the errors, you’ll need to check Lambda logs to diagnose the problem.

AWS Lambda Dead-Letter Errors

Services like SQS or DynamoDB asynchronously send events to Lambda. If an event fails more than once, it is sent to a “dead letter queue”. A dead letter error signifies that there was an issue sending the event to that dead letter queue. These errors are critical because they can lead to data loss. They can be caused by incorrect permissions, incorrect resource configuration, or size limits.

AWS Lambda Function Duration

Function duration – the time each Lambda function takes to run – is an important metric because it can affect many aspects of your serverless application:

Long-running functions needlessly increase costs, because functions are billed by actual running time
Long function duration may indicate performance issues or high latency for users
If functions exceed the configured timeout, they will terminate, possibly disrupting service to users

It is important to realize that if a function actually times out, CloudWatch doesn’t log function duration, so it’s important to identify functions with a duration close to the timeout. One of the following conditions can increase the duration of the function:

Recent code changes reduced the efficiency of the function
A dependency referenced by the function is slow to respond
An algorithm error or a request for the wrong data

AWS Lambda Function Timeout

In AWS Lambda, timeouts are a severe issue that can hurt the serverless application’s performance. When function duration reaches the preset time out, the function immediately stops running. CloudWatch does not report this metric on its own – it is reported together with other generic errors. To identify timeouts, you’ll need to create a custom alert in CloudWatch or use a serverless monitoring tool like Lumigo. If you experience timeouts, use the following process to resolve the issue:

Increase the timeout to a value that will not cause your functions to terminate (preferably lower than the maximum of 300 seconds)
Find the root cause of slow functions and fix it
Ensure functions are running at the appropriate duration, and restore the timeout to a lower value

AWS Lambda Function Invocations

When a function is called successfully, CloudWatch records that an invocation occurred. This metric does not count throttled invocations. The significance of this metric is that it is used by AWS in billing calculations. Any major changes in the number of invocations will dramatically increase your AWS Lambda costs. If you notice a spike in the number of invocations, check the following:

There may be an issue with triggers that invoke your serverless function – for example, too many events on services like Kinesis or SQS.
Functions may be failing, causing retries which result in additional invocations for the same event.

AWS Lambda Iterator Age

This metric is primarily used for streaming data. Iterator Age is defined as the time it takes for the latest record in a stream to reach Lambda and be processed by a function. If the metric increases, it means that a backlog of events is building up, and the serverless application is not keeping up. To resolve an iterator age issue, consider increasing concurrency, or making sure events are not too large or complex for your functions to handle in a timely manner.

CloudWatch Metrics Limitations

A number of valuable metrics are sadly missing from CloudWatch monitoring, including:

Concurrent Executions: CloudWatch does report this metric, but only for functions with reserved concurrency. However, it’s a useful metric to have for all functions.
Cold Start Count

Memory Usage and Billed Duration: Lambda reports these in CloudWatch Logs, at the end of every invocation. But they are not available as metrics. You can, however, turn them into custom metrics using metric filters.
Timeout Count: timeouts are a special type of systematic error that should be recorded as a separate metric. So often I have seen teams waste valuable time searching for error messages in the logs, only to realize that there wasn’t any because their function had timed out. Instead, you should log these timeout events and use metric filters to record them as a custom metric.
Estimated Cost: another useful metric to have would be the estimated cost of a function. This can help you make informed decisions on which functions to optimize. For example, it makes no sense to optimize a function whose net spend per month is $10. The effort and cost of optimizing the function would far outweigh any potential savings.

Another problem with CloudWatch Metrics is that its percentile metrics for Lambda doesn’t work consistently. When it comes to monitoring latencies, you should be using percentiles instead of the average. However, when a function experiences more than ~100 invocations per minute, the percentile latencies stop working! This is a critical issue that we have raised with AWS, and hopefully, it will be addressed in the near future. In the meantime, you can fall back to using a combination of average and max duration. For APIs, you can also use API Gateway’s Latency and IntegrationLatency metrics instead. update 19/01/2020: the issue with Lambda’s percentile latency metrics has been fixed as of this release on Nov 26, 2019. Learn more about Lambda monitoring in our guide: Lambda Logs: a Complete Guide

CloudWatch Dashboards

You can also set up dashboards in CloudWatch at a cost of $3 per month per dashboard (first 3 are free). CloudWatch supports a variety of widget types, and you can even include query results from CloudWatch Logs Insights.

You can compose your dashboards with any metrics from CloudWatch (including custom metrics). For example, the following dashboard is composed of several API Gateway metrics and highlights the health and performance of an API.

You can also use Metric Math to create computed metrics and include them in your dashboards. For example, the Status Codes widget below uses Metric Math to calculate the number of 2XX responses which is not available as a metric.

Once you have handcrafted your dashboard. You can click Actions, View/edit source to see the code behind for the dashboard.

You can then codify the dashboard as an AWS::CloudWatch::Dashboard resource in a CloudFormation template. You will have to parameterize some of the fields such as API name and region so that the template can be used for different stages and regions.

Designing Service Dashboards

As a rule of thumb, you should limit dashboards to only the most relevant and significant information about the health of a system. For APIs, consider including the following:

95th/99th percentile and max response times.
The number of 2XX, 4XX and 5XX.
The error rate, i.e. the percentage of requests that did not complete successfully.

It’s simple and tells me the general health of the API at a glance. “Keeping it simple” is easily the most important advice for building effective dashboards. It’s also the most difficult to follow because the temptation is always to add more information to dashboards. As a result, they often end up cluttered, confusing to read and slow to render as there are far too many data points on the screen. Here are a few tips for building service dashboards:

Use simple (boring) visualizations.
Use horizontal annotations to mark SLA thresholds, etc.
Use a consistent color scheme.
Put the most important metrics at the top to create a hierarchy. Also bear in mind that widgets below the fold are rarely seen.

This page has some simple guidelines for designing dashboards. Stephen Few’s Information Dashboard Design is also a great read if you want to dive deeper into data visualization with dashboards.

CloudWatch Metrics Alerting

Besides the per-function metrics, CloudWatch monitoring also reports a number of metrics that are aggregated across all functions:

While most of these aren’t very useful (given the lack of specificity), I strongly recommend that you set up an alert against the ConcurrentExecutions metric. Set the alert threshold to ~80% of the regional concurrency limit (defaults to 1000 in most regions). When you raise this soft limit via support, don’t forget to update the alert to reflect the new regional limit. For individual functions, consider adding the following alerts for each:

Error rate: use metric math to calculate the error rate (error count/invocations count). Alert when the error rate goes above say, 1%.
Timeouts: as discussed earlier, CloudWatch does not publish a separate metric for timeout errors. Instead, you should create a metric filter to capture timeout messages (see below) as a custom metric and set an alert on it.
Iterator age: for stream-based functions, set an alert against the IteratorAge metric so you know when your function is drifting behind.
SQS message age: for SQS functions, set an alert against the ApproximateAgeOfOldestMessage metric on the queue. As this metric goes up, it signals that your SQS function is not keeping up with throughput.
DLQ errors: set an alert when the number of DLQ errors is greater than 0. This is usually a bad sign. The DLQ is your last chance to capture failed events before they’re lost. So if Lambda is not able to publish them to the DLQ then data is lost.
Throttling: we sometimes use reserved concurrency to limit the max concurrency of a function and throttling would be expected behaviour in those cases. But for functions that do not have a reserved concurrency, we should have alerts for when they’re throttled. This is especially true for user-facing API functions, where we cannot count on built-in retries and the throttling impacts user experience.
API latency: for APIs, especially user-facing APIs, you should set up alerts based on your SLA/SLO. For example, alert when the 95 percentile latency is over 3s for five consecutive minutes. This alerts you to degraded performances in the system. It’s possible to do this with Lambda duration too. But I find it better to alert with API Gateway’s Latency metric because it’s closer to an end-to-end metric. If the degraded performance is due to problems in API Gateway, you still want to be notified as it has user impact nonetheless.

So that’s a lot of alerts we have to set up! Since most of them follow a certain convention, we should automate the process of creating them. The ACloudGuru team created a handy plugin for the Serverless framework. However, it still requires a lot of configuration, especially if you don’t agree with the plugin’s defaults. My preferred approach is to automatically create alerts CloudFormation macros. If you want to learn more about CloudFormation macros and how to create them, check out this excellent post by Alex Debrie.

AWS Lambda CloudWatch Events

CloudWatch Events lets you receive a stream of events that indicate changes in any AWS service. A common use of CloudWatch Events is to trigger automated action that can help resolve a problem in production. CloudWatch Events are typically not used to diagnose or fix images in AWS Lambda. Rather, they can be used to trigger Lambda functions to solve issues in other AWS services. For example, you can invoke a Lambda function every time the state of an EC2 instance changes, and use it to perform logging or maintenance activities.

Summary

In this post, we took a deep dive into how you can use CloudWatch Metrics to monitor your Lambda functions. We looked at the metrics that you get out-of-the-box, and how to publish custom metrics. We explored some of the limitations with CloudWatch Metrics. We saw what you can do with dashboards in CloudWatch and discussed some tips for designing a service dashboard. Finally, we discussed what alerts you should set up so that you are duly notified when things go wrong. In my next post, we will take a deep dive into CloudWatch Logs to see how you can use it to help debug issues and the limits with CloudWatch Logs. See you next time! And, of course, don’t hesitate to get in touch if you have any questions about this article or CloudWatch Metrics in general.