With AWS Lambda, you have basic serverless monitoring built into the platform with CloudWatch. CloudWatch is an AWS monitoring platform that offers support for both metrics and logging, as well as the CloudWatch API which provides programmatic access to metrics. CloudWatch monitoring gives you basic metrics, visualization, and alerting while CloudWatch Logs captures everything that is written to stdout and stderr. In this post, we will take a deep dive into CloudWatch Metrics to see how you can use them for AWS Lambda monitoring, and the limitations of built-in AWS performance monitoring tools.
This is part of a series of articles about AWS lambda monitoring
In this article
Let’s review the key Lambda metrics you can track using AWS monitoring tools.
The errors metric in Lambda count two types of issues:
CloudWatch gives you the number of function calls that resulted in an error. To fix the errors, you’ll need to check Lambda logs to diagnose the problem.
Services like SQS or DynamoDB asynchronously send events to Lambda. If an event fails more than once, it is sent to a “dead letter queue”. A dead letter error signifies that there was an issue sending the event to that dead letter queue. These errors are critical because they can lead to data loss. They can be caused by incorrect permissions, incorrect resource configuration, or size limits.
Function duration – the time each Lambda function takes to run – is an important metric because it can affect many aspects of your serverless application:
It is important to realize that if a function actually times out, CloudWatch doesn’t log function duration, so it’s important to identify functions with a duration close to the timeout. One of the following conditions can increase the duration of the function:
In AWS Lambda, timeouts are a severe issue that can hurt the serverless application’s performance. When function duration reaches the preset time out, the function immediately stops running. CloudWatch does not report this metric on its own – it is reported together with other generic errors. To identify timeouts, you’ll need to create a custom alert in CloudWatch or use a serverless monitoring tool like Lumigo. If you experience timeouts, use the following process to resolve the issue:
When a function is called successfully, CloudWatch records that an invocation occurred. This metric does not count throttled invocations. The significance of this metric is that it is used by AWS in billing calculations. Any major changes in the number of invocations will dramatically increase your AWS Lambda costs. If you notice a spike in the number of invocations, check the following:
This metric is primarily used for streaming data. Iterator Age is defined as the time it takes for the latest record in a stream to reach Lambda and be processed by a function. If the metric increases, it means that a backlog of events is building up, and the serverless application is not keeping up. To resolve an iterator age issue, consider increasing concurrency, or making sure events are not too large or complex for your functions to handle in a timely manner.
A number of valuable metrics are sadly missing from CloudWatch monitoring, including:
Another problem with CloudWatch Metrics is that its percentile metrics for Lambda doesn’t work consistently. When it comes to monitoring latencies, you should be using percentiles instead of the average. However, when a function experiences more than ~100 invocations per minute, the percentile latencies stop working! This is a critical issue that we have raised with AWS, and hopefully, it will be addressed in the near future. In the meantime, you can fall back to using a combination of average and max duration. For APIs, you can also use API Gateway’s Latency and IntegrationLatency metrics instead. update 19/01/2020: the issue with Lambda’s percentile latency metrics has been fixed as of this release on Nov 26, 2019. Learn more about Lambda monitoring in our guide: Lambda Logs: a Complete Guide
You can also set up dashboards in CloudWatch at a cost of $3 per month per dashboard (first 3 are free). CloudWatch supports a variety of widget types, and you can even include query results from CloudWatch Logs Insights.
You can compose your dashboards with any metrics from CloudWatch (including custom metrics). For example, the following dashboard is composed of several API Gateway metrics and highlights the health and performance of an API.
You can also use Metric Math to create computed metrics and include them in your dashboards. For example, the Status Codes widget below uses Metric Math to calculate the number of 2XX responses which is not available as a metric.
Once you have handcrafted your dashboard. You can click Actions, View/edit source to see the code behind for the dashboard.
You can then codify the dashboard as an AWS::CloudWatch::Dashboard resource in a CloudFormation template. You will have to parameterize some of the fields such as API name and region so that the template can be used for different stages and regions.
As a rule of thumb, you should limit dashboards to only the most relevant and significant information about the health of a system. For APIs, consider including the following:
It’s simple and tells me the general health of the API at a glance. “Keeping it simple” is easily the most important advice for building effective dashboards. It’s also the most difficult to follow because the temptation is always to add more information to dashboards. As a result, they often end up cluttered, confusing to read and slow to render as there are far too many data points on the screen. Here are a few tips for building service dashboards:
This page has some simple guidelines for designing dashboards. Stephen Few’s Information Dashboard Design is also a great read if you want to dive deeper into data visualization with dashboards.
Besides the per-function metrics, CloudWatch monitoring also reports a number of metrics that are aggregated across all functions:
While most of these aren’t very useful (given the lack of specificity), I strongly recommend that you set up an alert against the ConcurrentExecutions metric. Set the alert threshold to ~80% of the regional concurrency limit (defaults to 1000 in most regions). When you raise this soft limit via support, don’t forget to update the alert to reflect the new regional limit. For individual functions, consider adding the following alerts for each:
So that’s a lot of alerts we have to set up! Since most of them follow a certain convention, we should automate the process of creating them. The ACloudGuru team created a handy plugin for the Serverless framework. However, it still requires a lot of configuration, especially if you don’t agree with the plugin’s defaults. My preferred approach is to automatically create alerts CloudFormation macros. If you want to learn more about CloudFormation macros and how to create them, check out this excellent post by Alex Debrie.
CloudWatch Events lets you receive a stream of events that indicate changes in any AWS service. A common use of CloudWatch Events is to trigger automated action that can help resolve a problem in production. CloudWatch Events are typically not used to diagnose or fix images in AWS Lambda. Rather, they can be used to trigger Lambda functions to solve issues in other AWS services. For example, you can invoke a Lambda function every time the state of an EC2 instance changes, and use it to perform logging or maintenance activities.
In this post, we took a deep dive into how you can use CloudWatch Metrics to monitor your Lambda functions. We looked at the metrics that you get out-of-the-box, and how to publish custom metrics. We explored some of the limitations with CloudWatch Metrics. We saw what you can do with dashboards in CloudWatch and discussed some tips for designing a service dashboard. Finally, we discussed what alerts you should set up so that you are duly notified when things go wrong. In my next post, we will take a deep dive into CloudWatch Logs to see how you can use it to help debug issues and the limits with CloudWatch Logs. See you next time! And, of course, don’t hesitate to get in touch if you have any questions about this article or CloudWatch Metrics in general.