With AWS Lambda, you have basic observability built into the platform with CloudWatch. CloudWatch offers support for both metrics and logging.
CloudWatch Metrics gives you basic metrics, visualization and alerting while CloudWatch Logs captures everything that is written to stdout and stderr. In this post, we will take a deep dive into CloudWatch Metrics to see how you can use it to monitor your Lambda functions and its limitations.
You get all the basic telemetry about the health of a function out of the box:
In addition, you also have some metrics that are only relevant to specific event sources:
In addition to these built-in metrics, you can also record custom metrics and publish them to CloudWatch Metrics. You can do this in a number of ways, including:
A number of valuable metrics are sadly missing, including:
Another problem with CloudWatch Metrics is that its percentile metrics for Lambda doesn’t work consistently. When it comes to monitoring latencies, you should be using percentiles instead of the average. However, when a function experiences more than ~100 invocations per minute, the percentile latencies stop working! This is a critical issue that we have raised with AWS, and hopefully, it will be addressed in the near future. In the meantime, you can fall back to using a combination of average and max duration. For APIs, you can also use API Gateway’s Latency and IntegrationLatency metrics instead.
update 19/01/2020: the issue with Lambda’s percentile latency metrics has been fixed as of this release on Nov 26, 2019.
You can also set up dashboards in CloudWatch at a cost of $3 per month per dashboard (first 3 are free). CloudWatch supports a variety of widget types, and you can even include query results from CloudWatch Logs Insights.
You can compose your dashboards with any metrics from CloudWatch (including custom metrics). For example, the following dashboard is composed of several API Gateway metrics and highlights the health and performance of an API.
You can also use Metric Math to create computed metrics and include them in your dashboards. For example, the Status Codes widget below uses Metric Math to calculate the number of 2XX responses which is not available as a metric.
Once you have handcrafted your dashboard. You can click Actions, View/edit source to see the code behind for the dashboard.
You can then codify the dashboard as an AWS::CloudWatch::Dashboard resource in a CloudFormation template. You will have to parameterize some of the fields such as API name and region so that the template can be used for different stages and regions.
As a rule of thumb, you should limit dashboards to only the most relevant and significant information about the health of a system. For APIs, consider including the following:
It’s simple and tells me the general health of the API at a glance.
“Keeping it simple” is easily the most important advice for building effective dashboards. It’s also the most difficult to follow because the temptation is always to add more information to dashboards. As a result, they often end up cluttered, confusing to read and slow to render as there are far too many data points on the screen.
Here are a few tips for building service dashboards:
Besides the per-function metrics, CloudWatch also reports a number of metrics that are aggregated across all functions:
While most of these aren’t very useful (given the lack of specificity), I strongly recommend that you set up an alert against the ConcurrentExecutions metric. Set the alert threshold to ~80% of the regional concurrency limit (defaults to 1000 in most regions). When you raise this soft limit via support, don’t forget to update the alert to reflect the new regional limit.
For individual functions, consider adding the following alerts for each:
So that’s a lot of alerts we have to set up! Since most of them follow a certain convention, we should automate the process of creating them. The ACloudGuru team created a handy plugin for the Serverless framework. However, it still requires a lot of configuration, especially if you don’t agree with the plugin’s defaults.
My preferred approach is to automatically create alerts CloudFormation macros. If you want to learn more about CloudFormation macros and how to create them, check out this excellent post by Alex Debrie.
In this post, we took a deep dive into how you can use CloudWatch Metrics to monitor your Lambda functions.
We looked at the metrics that you get out-of-the-box, and how to publish custom metrics. We explored some of the limitations with CloudWatch Metrics. We saw what you can do with dashboards in CloudWatch and discussed some tips for designing a service dashboard. Finally, we discussed what alerts you should set up so that you are duly notified when things go wrong.
In my next post, we will take a deep dive into CloudWatch Logs to see how you can use it to help debug issues and the limits with CloudWatch Logs.
See you next time! And, of course, don’t hesitate to get in touch if you have any questions about this article or CloudWatch Metrics in general.