Serverless monitoring is a key element in ensuring application reliability and security. A good monitoring system alerts you about errors in serverless applications before they ever affect your customers, allowing you to quickly issue fixes and maintain a high level of value delivery for your application’s users.
Monitoring for serverless applications is challenging due to the distributed nature of the serverless architecture. This guide is designed to give you an overview of the challenges faced when setting up serverless monitoring and alerting. We’ll explore what tools are available, and how to create a bulletproof serverless application that your users will love.
This is part of an extensive series of guides about Observability.
In this article
This is part of our comprehensive guide to performance testing in a cloud native world.
In a traditional web application, you have full ownership – and full visibility – into the entire application stack. A call to your backend server will always have roughly the same overhead. Your synchronous web calls will always execute in the same general sequence. You can quickly identify bottlenecks as each element of the stack behaves in a predictable fashion.
In serverless applications, observability is more challenging because your infrastructure is almost entirely ephemeral:
While monitoring and logging are extremely important for gauging application health, in an environment where every request will likely go to an external machine, it is important to note that the simple cost of monitoring your application is likely to be higher as a result.
Learn more in our detailed guide to serverless observability.
Here are a few of the common monitoring challenges that arise in a serverless application.
Most serverless function providers implement a hot-cold architecture. Basically, the more frequently a serverless function is called, the more available it will be for future calls. Functions that are called frequently in this manner are referred to as “hot” functions.
When a function is idle for any length of time, though, you run the risk of the serverless provider reclaiming the resources used to make your function available. The next time one of these functions is called, the serverless provider needs to spin up associated resources to complete your application’s request. This is known as a “cold” start.
While an individual cold start doesn’t incur too much overhead – normally on the order of 100 milliseconds – enough cold starts strung together can result in a significant impact to user experience. For example, a low-traffic web page with ten serverless function calls can incur up to a full second of additional wait time for cold starts.
Read more on Cold Starts:
Depending on the provider you choose, you may have very limited choices when it comes to managing the run-time memory of your serverless functions. This can have unexpected effects in your application’s resource usage.
One example is with AWS Lambda functions. During configuration of a Lambda function, you often specify the amount of RAM that should be allocated to your function as it runs.
What is often not clearly stated is that this choice can also determine the processing power allocated to your serverless function, with larger RAM requests resulting in more powerful processor allocations. Given that processing power is a factor in determining your research usage, this results in potentially increased resource usage in your serverless application – and the higher usage bills that go along with it.
Read more on memory usage:
The promise of a serverless architecture is that your functions are only available when they are needed, allowing you to save money on resources by not paying for unnecessary availability.
What happens, though, if your application begins to scale? Many serverless function providers include a concurrency limit in execution. If your application’s activity causes your functions to exceed this concurrency limit, then unpredictable behavior may occur.
Concurrency limitations can manifest as longer execution times (while a request waits for an available machine to execute the function), server errors from the provider, or other failures of execution that can severely impact the user experience. As such, it is important to plan around these concurrency limitations and be aware of when you are approaching thresholds defined by your serverless provider.
Read more on concurrency limitations:
In a traditional web application, your resource availability is easily discoverable and often well-known by your application maintainers.
In a serverless architecture it is more difficult to identify these limitations. Given that serverless applications rely upon on-demand architecture, you can often run into cases where a function simply fails to respond. This can be due to a temporary issue on the provider, a bug in your code that is causing silent failures, or any of a number of potential reasons in-between.
To protect against non-responding resources, you’ll need more than just practice defensive coding to ensure graceful degradation of the user experience – you will need additional monitoring to catch these scenarios when they happen. Monitoring characteristics like this will help you identify patterns in your application’s behavior, allowing you to potentially predict failures before they happen (as well as respond more quickly when they do occur).
Read more on resource availability limitations:
Generally, serverless functions are only charged for the time during which they execute. This means you pay only for the processing power actually used, saving money when your application is still growing.
However, once the activity in your application begins to grow, your costs can increase very quickly. In an ideal world your costs will increase predictably along with the size of your user base, but in reality there are several scenarios that can result in an unexpectedly high bill at the end of the month.
A misconfigured Lambda function, for example, can end up using a processor that is much more powerful – and more costly – than your function actually needs. Furthermore, a denial-of-service attack can quickly cause your serverless compute usage to balloon as your attackers stress your back-end. Be sure to incorporate this into your monitoring to protect against sudden unexpected infrastructure bills.
Read more on serverless function costs:
Once your monitoring alerts you to potential issues, often your next step is finding out what, exactly, is going wrong with your code
Logs are crucial tools in this step. If properly used they can provide you with a ready snapshot of your application’s recent activity. In a traditional web application, these logs provide a dependable look at the sequence of events as they occurred in your application, helping you more quickly track down the events leading up to a failure and identify code that warrants further investigation.
However, tracing through serverless log activity can be complex. Instead of a cohesive set of server calls that hit predictable, always-available hardware, the functionality of your application is split across multiple disparate machines. Each of them has its own separate logging mechanism, which must be investigated separately.
Without pre-work to ensure that you can cohesively trace an execution path through the logs of your application’s serverless function calls, you are often left with multiple views of small chunks of the application’s behavior. Identifying the trouble spots in your application becomes tougher as the logs are no longer colocated by default and are grouped by function instead of the execution path.
To work around this limitation, it’s important to use a distributed tracing system, allowing you to trace through your application’s execution.
A distributed tracing system for your application can be as simple as adding a transaction wrapper that ensures every request shares a traceable ID, implementing a means of aggregating logs from the downstream services, or making use of third-party tools to provide a more coherent view of your application’s execution flow.
The right choice will depend on the implementation of your application, and as such needs to be accounted for during software architecture and design.
Read more on Centralized Logging and Distributed Tracing:
As we explore tools available, we’ll focus on monitoring AWS Lambda functions as their ecosystem represents approximately 77% of the serverless function market, but most serverless function providers offer similar tools with similar functionality.
Amazon CloudWatch is a dedicated tool for monitoring the performance characteristics of your application’s AWS-driven resources. CloudWatch aggregates statistics from your AWS resource usage and provides logs, metrics, the capability to automate alerts, and more. Through use of CloudWatch you can see the activity being performed by your serverless functions, monitor resource usage to identify bottlenecks in your application architecture, and set up automated alerts for the riskier portions of your application. Cloudwatch will likely be at the core of your lambda monitoring system, giving you access to logs for AWS Lambda, monitoring memory usage, and reporting on general function health.
Read more on CloudWatch:
AWS X-Ray is a tool designed to help you more easily analyze and debug distributed applications. One of its key selling points is the ability to offer tracing for your application’s request, giving you the capability to follow the execution path of your application across the many different resources it consumes. It integrates deeply with many AWS services, and when fully implemented can help you identify bottlenecks in your application, troubleshoot erroneous behavior, and monitor excessive resource usage.
When coupled with CloudWatch and other monitoring tools, AWS X-Ray can give you a development environment that begins to approach the fidelity of a traditional web application, giving you the AWS Lambda monitoring you need to feel secure in your serverless application’s function.
Read more on AWS X-Ray:
Cloud provider monitoring tools can be very powerful in their own right, but they are not without their limitations:
Dedicated serverless monitoring solutions can address the challenges of first-party cloud provider monitoring tools. One such solution is Lumigo, a serverless monitoring platform that lets you monitor serverless applications effortlessly, with full distributed tracing to help you identify root causes and debug issues quickly.
Lumigo can help you:
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of observability.
Authored by Lumigo
Authored by Komodor
Authored by Tigera