All Posts

You now have deeper insights into Lambda ESM with these new metrics

Lambda’s Event-Source Mapping (ESM) has been a game-changer for Lambda users. It gives users an easy and cost-efficient way to process events from Amazon SQS, Amazon Kinesis, Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon MQ and more.

It handles all the complexities around polling, including scaling the no. of pollers. And there’s no charge for this invisible layer of infrastructure!

But one downside with ESM is that it’s a black box and we don’t have visibility into its working. How many events did it receive? Were there any problems with lambda invocations?

We can infer some of these through other service metrics. For example, we can use SQS’s NumberOfMessagesReceived metric to infer the number of messages the ESM should have received.

But this doesn’t account for problems between SQS and ESM. For example, if there are networking or permission issues that affect ESM’s ability to poll from the queues.

Similarly, if a function does not have sufficient concurrency to process events, ESM will retry the throttled invocations automatically. But without visibility, it’s possible to lose data if events are not processed before they expire from the stream or queue.

It’s good to see AWS address these gaps with the new ESM metrics. Today, AWS Lambda released a set of ESM metrics for SQS, Kinesis and DynamoDB streams.

Here are all the new metrics.

Common ESM metrics

There are 3 common metrics for all supported ESM sources:

  • PolledEventCount: The number of events ESM successfully polled from the source. Monitor this to catch problems related to polling, e.g. permission issues, or misconfigured event source.
  • InvokedEventCount: The number of events ESM sent to the target Lambda function, both success and failures. In case of errors, this metric can be higher than the PolledEventCount due to retries.
  • FilteredOutEventCount: The number of events that have been filtered out by ESM when you have configured a filter criteria.

SQS ESM metrics

In addition to the common ESM metrics above, SQS ESM also have these metrics.

  • DeletedEventCount: The number of messages that ESM deletes from SQS upon successful processing. Monitor this to catch problems related to SQS delete permissions.
  • FailedInvokeEventCount: The number of SQS messages that were not processed successfully. If ReportBatchItemFailures is used then this is the number of messages in the BatchItemFailures array. Otherwise, it’s the whole batch.

Kinesis/DynamoDB ESM metrics

In addition to the common ESM metrics above, Kinesis and DynamoDB ESM also have these metrics.

  • DroppedEventCount: The number of records dropped by ESM due to expiry (when MaximumRecordAgeInSeconds is configured) or max retries being exhausted.
  • OnFailureDestinationDeliveredEventCount: The number of records ESM successfully sent to the OnFailure Destination (when one is configured). This should match the DroppedEventCount. If they diverge, then there’s a problem with sending filtered/failed events to the OnFailure Destination. However, these delivery failures should be covered by the DestinationDeliveryFailures metric already, so I’m not sure how useful this new metric is.
  • FailedInvokeEventCount: This is similar to the FailedInvokeEventCount metric for SQS ESM but with its own… erm… unique twists… If you use ReportBatchItemFailures then this counts “records in the batch between the lowest sequence number returned in BatchItemFailures and the largest sequence number in the batch”. It’s not you, I don’t get it neither… If you use BisectBatchOnFunctionError then this metric counts the entire batch upon function failure. It feels like some implementation detail that we’re not privy to has leaked into this metric. I think this can be useful as a soft signal that something’s wrong, but not as a precise measure of how many events are affected.

Cost

These new metrics are charged at CloudWatch’s standard rate of $0.30 per metric per month. However, there is no charge for the PutMetric API calls.

Get started

You can enable these ESM metrics on a per-ESM basis through CloudFormation, AWS SDK, AWS CLI and the AWS console.

CloudFormation

You can enable the new metrics via a new MetricsConfig property for the AWS::Lambda::EventSourceMapping resource type.

 

Type: AWS::Lambda::EventSourceMapping
Properties:
  ... // usual ESM properties
  MetricsConfig: 
    Metrics:
      - EventCount

 

The Metrics property nested in MetricsConfig is typed as an array for future purposes. The only allowed value today is EventCount and it enables the new metrics mentioned above, based on the type of ESM.

AWS Console

First, go to the Lambda function’s page and click on one of the ESMs.

Select the ESM you want to enable and click “Edit”.

Check “Enable metrics”.

The challenge with these new metrics is that they are grouped by ESM ID. If you have multiple ESMs, it’s hard to figure out which is which at a glance.

The easiest way to find the ESM UUID is the Lambda console.

Creating alerts in Lumigo

Once you have enabled the metrics, you can easily create alerts in Lumigo.

1. Select “Metrics” and then “CloudWatch Metrics”.

2. Add a description for the alarm and set the region and metric category.

3. Choose the metric, ESM UUID, and the statistic (average in this case) and the threshold for the alarm.

4. And finally, how you’d like to be notified.

Lumigo integrates with a wide range of platforms. So you can get notified through a number of different ways.

Summary

I love using managed capabilities like ESM. They let us do more with less and focus on the things that matter to our customers.

But when things go wrong, it can feel like we are flying blind.

The introduction of these new ESM metrics provides much-needed visibility into the inner workings of ESM. They help us monitor performance and troubleshoot issues.

While there is a nominal cost associated with these metrics, the benefits of enhanced monitoring and the potential to prevent significant issues make them a valuable addition to your serverless toolkit.

I encourage you to enable these metrics for your critical event sources and use Lumigo to alert you on your preferred channel.

This may also interest you