Guide Content

Guide Content

Serverless Monitoring Challenges and Achieving Observability

Monitoring Serverless Applications Effortlessly

Serverless monitoring is a key element in ensuring application reliability and security. A good monitoring system alerts you about errors in serverless applications before they ever affect your customers, allowing you to quickly issue fixes and maintain a high level of value delivery for your application’s users.

Monitoring for serverless applications is challenging due to the distributed nature of the serverless architecture. This guide is designed to give you an overview of the challenges faced when setting up serverless monitoring and alerting. We’ll explore what tools are available, and how to create a bulletproof serverless application that your users will love.

This is part of an extensive series of guides about Observability.

In this article

This is part of our comprehensive guide to performance testing in a cloud native world.

Challenges of Serverless Monitoring

In a traditional web application, you have full ownership – and full visibility – into the entire application stack. A call to your backend server will always have roughly the same overhead. Your synchronous web calls will always execute in the same general sequence. You can quickly identify bottlenecks as each element of the stack behaves in a predictable fashion.

In serverless applications, observability is more challenging because your infrastructure is almost entirely ephemeral:

While your main content servers may remain in an active state, the serverless functions containing your application’s backend might re-instantiated in every call to your application’s.
The stateless nature of serverless functions also introduces challenges, as you no longer maintain your application in terms of discrete multi-event transactions.
Timing becomes unpredictable, as you incur additional overhead for each call to a function that has been idle for any length of time.
The shift from paying for resource availability to only paying for the resources your application uses makes your infrastructure costs less predictable.
Your functions operate entirely independently. Traditional monitoring tools tend to have higher costs in a serverless application, due to the distributed nature of the architecture. This can result in issues like incomplete tracing for exceptions and additional performance hits for remote metric tracking systems.

While monitoring and logging are extremely important for gauging application health, in an environment where every request will likely go to an external machine, it is important to note that the simple cost of monitoring your application is likely to be higher as a result.

Learn more in our detailed guide to serverless observability.

5 Things that Can Go Wrong in a Serverless Application

Here are a few of the common monitoring challenges that arise in a serverless application.

1. Cold Starts

Most serverless function providers implement a hot-cold architecture. Basically, the more frequently a serverless function is called, the more available it will be for future calls. Functions that are called frequently in this manner are referred to as “hot” functions.

When a function is idle for any length of time, though, you run the risk of the serverless provider reclaiming the resources used to make your function available. The next time one of these functions is called, the serverless provider needs to spin up associated resources to complete your application’s request. This is known as a “cold” start.

While an individual cold start doesn’t incur too much overhead – normally on the order of 100 milliseconds – enough cold starts strung together can result in a significant impact to user experience. For example, a low-traffic web page with ten serverless function calls can incur up to a full second of additional wait time for cold starts.

Read more on Cold Starts:

2. Memory Usage

Depending on the provider you choose, you may have very limited choices when it comes to managing the run-time memory of your serverless functions. This can have unexpected effects in your application’s resource usage.

One example is with AWS Lambda functions. During configuration of a Lambda function, you often specify the amount of RAM that should be allocated to your function as it runs.

What is often not clearly stated is that this choice can also determine the processing power allocated to your serverless function, with larger RAM requests resulting in more powerful processor allocations. Given that processing power is a factor in determining your research usage, this results in potentially increased resource usage in your serverless application – and the higher usage bills that go along with it.

Read more on memory usage:

Best Practices for Working With AWS Lambda Functions

3. Concurrency Limitations

The promise of a serverless architecture is that your functions are only available when they are needed, allowing you to save money on resources by not paying for unnecessary availability.

What happens, though, if your application begins to scale? Many serverless function providers include a concurrency limit in execution. If your application’s activity causes your functions to exceed this concurrency limit, then unpredictable behavior may occur.

Concurrency limitations can manifest as longer execution times (while a request waits for an available machine to execute the function), server errors from the provider, or other failures of execution that can severely impact the user experience. As such, it is important to plan around these concurrency limitations and be aware of when you are approaching thresholds defined by your serverless provider.

Read more on concurrency limitations:

4. Resource Availability Limitations

In a traditional web application, your resource availability is easily discoverable and often well-known by your application maintainers.

In a serverless architecture it is more difficult to identify these limitations. Given that serverless applications rely upon on-demand architecture, you can often run into cases where a function simply fails to respond. This can be due to a temporary issue on the provider, a bug in your code that is causing silent failures, or any of a number of potential reasons in-between.

To protect against non-responding resources, you’ll need more than just practice defensive coding to ensure graceful degradation of the user experience – you will need additional monitoring to catch these scenarios when they happen. Monitoring characteristics like this will help you identify patterns in your application’s behavior, allowing you to potentially predict failures before they happen (as well as respond more quickly when they do occur).

Read more on resource availability limitations:

AWS Lambda Limits

5. Cost

Generally, serverless functions are only charged for the time during which they execute. This means you pay only for the processing power actually used, saving money when your application is still growing.

However, once the activity in your application begins to grow, your costs can increase very quickly. In an ideal world your costs will increase predictably along with the size of your user base, but in reality there are several scenarios that can result in an unexpectedly high bill at the end of the month.

A misconfigured Lambda function, for example, can end up using a processor that is much more powerful – and more costly – than your function actually needs. Furthermore, a denial-of-service attack can quickly cause your serverless compute usage to balloon as your attackers stress your back-end. Be sure to incorporate this into your monitoring to protect against sudden unexpected infrastructure bills.

Read more on serverless function costs:

Serverless Monitoring Options: Centralized Logging vs. Distributed Tracing

Once your monitoring alerts you to potential issues, often your next step is finding out what, exactly, is going wrong with your code

Logs are crucial tools in this step. If properly used they can provide you with a ready snapshot of your application’s recent activity. In a traditional web application, these logs provide a dependable look at the sequence of events as they occurred in your application, helping you more quickly track down the events leading up to a failure and identify code that warrants further investigation.

However, tracing through serverless log activity can be complex. Instead of a cohesive set of server calls that hit predictable, always-available hardware, the functionality of your application is split across multiple disparate machines. Each of them has its own separate logging mechanism, which must be investigated separately.

Without pre-work to ensure that you can cohesively trace an execution path through the logs of your application’s serverless function calls, you are often left with multiple views of small chunks of the application’s behavior. Identifying the trouble spots in your application becomes tougher as the logs are no longer colocated by default and are grouped by function instead of the execution path.

To work around this limitation, it’s important to use a distributed tracing system, allowing you to trace through your application’s execution.

A distributed tracing system for your application can be as simple as adding a transaction wrapper that ensures every request shares a traceable ID, implementing a means of aggregating logs from the downstream services, or making use of third-party tools to provide a more coherent view of your application’s execution flow.

The right choice will depend on the implementation of your application, and as such needs to be accounted for during software architecture and design.

Read more on Centralized Logging and Distributed Tracing:

AWS Serverless Monitoring Tools

As we explore tools available, we’ll focus on monitoring AWS Lambda functions as their ecosystem represents approximately 77% of the serverless function market, but most serverless function providers offer similar tools with similar functionality.

Amazon CloudWatch

Amazon CloudWatch is a dedicated tool for monitoring the performance characteristics of your application’s AWS-driven resources. CloudWatch aggregates statistics from your AWS resource usage and provides logs, metrics, the capability to automate alerts, and more. Through use of CloudWatch you can see the activity being performed by your serverless functions, monitor resource usage to identify bottlenecks in your application architecture, and set up automated alerts for the riskier portions of your application. Cloudwatch will likely be at the core of your lambda monitoring system, giving you access to logs for AWS Lambda, monitoring memory usage, and reporting on general function health.

Read more on CloudWatch:

AWS X-Ray

AWS X-Ray is a tool designed to help you more easily analyze and debug distributed applications. One of its key selling points is the ability to offer tracing for your application’s request, giving you the capability to follow the execution path of your application across the many different resources it consumes. It integrates deeply with many AWS services, and when fully implemented can help you identify bottlenecks in your application, troubleshoot erroneous behavior, and monitor excessive resource usage.

When coupled with CloudWatch and other monitoring tools, AWS X-Ray can give you a development environment that begins to approach the fidelity of a traditional web application, giving you the AWS Lambda monitoring you need to feel secure in your serverless application’s function.

Read more on AWS X-Ray:

Getting Started with AWS X-Ray

Limitations of Cloud Provider First Party Monitoring Tools

Cloud provider monitoring tools can be very powerful in their own right, but they are not without their limitations:

CloudWatch is an excellent tool for metrics and logs crucial to AWS Lambda monitoring, but these logs are distributed by Amazon’s resource IDs. Getting a full picture of your application’s call paths becomes more challenging as the information you need is often split across the dashboards for multiple different serverless functions.
Native monitoring tools are locked into one ecosystem. You can monitor your lambda functions and set alarms based on their characteristics, for example, but if your application relies heavily on third-party tools you will miss potentially critical signals as your application runs.
Client-side code monitoring is missing from these reports. If your application has frontend-based monitoring and logging, you’ll need to leverage a third party to incorporate this information into your application’s alerts.

Dedicated Serverless Monitoring Platforms

Dedicated serverless monitoring solutions can address the challenges of first-party cloud provider monitoring tools. One such solution is Lumigo, a serverless monitoring platform that lets you monitor serverless applications effortlessly, with full distributed tracing to help you identify root causes and debug issues quickly.

Lumigo can help you:

Solve cold starts – easily obtain cold start-related metrics for your Lambda functions, including cold start %, average cold duration, and enabled provisioned concurrency. Generate real-time alerts on cold starts, so you’ll know instantly when a function is under-provisioned and can adjust provisioned concurrency.
Find and fix issues in seconds with visual debugging – Lumigo builds a virtual stack trace of all services participating in the transaction. Everything is displayed in a visual map that can be searched and filtered.
Automate distributed tracing – with one click and no manual code changes, Lumigo visualizes your entire environment, including your Lambdas, other AWS services, and every API call and external SaaS service.
Identify and remove performance bottlenecks – see the end-to-end execution duration of each service, and which services run sequentially and in parallel. Lumigo automatically identifies your worst latency offenders and provides a full timeline of Lambda invocations.
Receive serverless-specific smart alerts – using machine learning, Lumigo’s predictive analytics identifies and alerts on issues before they impact application performance or costs.

See Additional Guides on Key Observability Topics

Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of observability.

Lumigo Launches AI Agent Observability