Monitoring is a key element in ensuring application reliability and security. A good monitoring system alerts you to errors before they ever affect your customers, allowing you to quickly issue fixes and maintain a high level of value delivery for your application’s users. In a serverless context, monitoring becomes significantly more challenging due to the disparate nature of a serverless application’s architecture. This guide is designed to give you an overview of the challenges faced when setting up serverless monitoring and alerting. We’ll explore what tools are available, where their limitations lie, and explore mechanisms to work around these shortcomings and create a bulletproof serverless application that your users will love.
For an application company, your product’s trustworthiness depends heavily on the quality of experience delivered to your users.
For a traditional web application this includes metrics like API response time, page load times, exceptions encountered and surfaced, and any of the other hundreds of warning signs you watch for when developing your product. Many of these metrics can be automated, and by establishing thresholds around these metrics you can create an early-warning system that gives you a heads-up when something is about to go wrong.
These signals give you an increased level of confidence in your product, allowing you to more effectively deliver value to your users by relying upon the early warning signs these metrics provide to alert you when something is going wrong. Polling these metrics at a regular interval is the core of monitoring for web applications, giving you a real-time view of application performance. These performance metrics give you better visibility into how your application is behaving, allowing you to respond to events like increased load times before your users have a chance to complain.
The true benefit of monitoring, though, comes when something does go wrong. Without monitoring, these errors manifest as dropped HTTP calls, which can translate into confusing and frustrating behavior for your users.
When applied properly, a well-thought-out monitoring system can pick up application exceptions and errors as they are reported, routing them to the correct channel so that they can be handled. You’ll be able to more quickly identify when things go wrong, and reduce the time required to identify and fix the issues in your application.
In short, in a web application monitoring can be the key difference between
“This application just began behaving weirdly”
“The application hasn’t been functioning for twelve hours”
In a traditional web application, your control over the application stack is fairly complete. You can easily set up monitoring for a full-stack request that includes time to execute all relevant server-side calls, full page render times, and log analysis.
As you have full ownership – and full visibility – into the entire application stack, coordinating request actions is relatively straightforward as each call is fairly deterministic in its action. A call to your backend server will always have roughly the same overhead. Your synchronous web calls will always execute in the same general sequence. As you have visibility into every facet of your application’s execution, you can quickly identify bottlenecks as each element of the stack will behave more or less in a predictable fashion.
Serverless applications throw a wrench into the works. Where in a traditional application you have dedicated resources that are always available, in a serverless application your infrastructure is almost entirely ephemeral.
While your main content servers may remain static, the serverless functions containing your application’s back end will be re-instantiated multiple times during your application’s execution run.
The stateless nature of serverless functions also introduces challenges, as you no longer maintain your application in terms of discrete multi-event transactions. Timing becomes unpredictable as well, as you incur additional overhead for each call to a function that has been idle for any length of time.
Finally, as you transition from paying for resource availability to only paying for the resources your application uses, it becomes more challenging to determine the exact quantity of each resource used as your application operates – making your infrastructure costs less predictable in the process.
One other item to keep in mind with regard to serverless application monitoring is that your functions operate entirely independently. Traditional monitoring tools, as a result, tend to have higher costs in a serverless application, due to the distributed nature of the architecture. This can result in issues like incomplete tracing for exceptions and additional performance hits for remote metric tracking systems.
While monitoring and logging are extremely important for gauging application health, in an environment where every request will likely go to an external machine, it is important to note that the simple cost of monitoring your application is likely to be higher as a result.
There are several problems faced by serverless applications that are not present in a more traditional client-server application. The on-demand nature of the machines that drive your application’s serverless functionality exacerbate these problems, making them harder to solve than they would be in a single stack architecture.
These potential issues can vary in severity and impact, from minor increases in execution time all the way up to significant increases in resource costs. Below we’ll cover a few of the common pitfalls that arise in a serverless application, and how they can manifest.
Serverless functions are often an excellent choice when your application doesn’t need the constant availability provided by dedicated hardware. After all, if you’re only using one hour of execution time each day, why should you pay for the other 23?
In order to achieve this, most serverless function providers implement a hot-cold architecture. Basically, the more frequently a serverless function is called, the more available it will be for future calls. Functions that are called frequently in this manner are referred to as “hot” functions.
When a function is idle for any length of time, though, you run the risk of the serverless provider reclaiming the resources used to make your function available. The next time one of these functions is called, the serverless provider needs to spin up associated resources to complete your application’s request. This is known as a “cold” start.
While an individual cold start doesn’t incur too much overhead – normally on the order of 100 milliseconds – enough cold starts strung together can result in a significant impact to user experience. For example, a low-traffic web page with ten serverless function calls can incur up to a full second of additional wait time for cold starts.
Memory usage can also be challenging to monitor in serverless applications. Depending on the provider you choose, you may have very limited choices when it comes to managing the run-time memory of your serverless functions. This can have unexpected effects in your application’s resource usage.
One example is with AWS Lambda functions. During configuration of a Lambda function, you often specify the amount of RAM that should be allocated to your function as it runs.
What is often not clearly stated is that this choice can also determine the processing power allocated to your serverless function, with larger RAM requests resulting in more powerful processor allocations. Given that processing power is a factor in determining your research usage, this results in potentially increased resource usage in your serverless application – and the higher usage bills that go along with it.
The promise of a serverless architecture is that your functions are only available when they are needed, allowing you to save money on resources by not paying for unnecessary availability.
What happens, though, if your application begins to scale? Many serverless function providers include a concurrency limit in execution. If your application’s activity causes your functions to exceed this concurrency limit, then unpredictable behavior may occur.
Concurrency limitations can manifest as longer execution times (while a request waits for an available machine to execute the function), server errors from the provider, or other failures of execution that can severely impact user experience. As such, it is important to plan around these concurrency limitations and be aware of when you are approaching thresholds defined by your serverless provider.
In a traditional web application, your resource availability is easily discoverable and often well-known by your application maintainers. Owning the entire tech stack gives you an always-available architecture, as well as full control over the resources used by your application.
Finding these limitations becomes more complicated when dealing with a serverless architecture. Given that serverless applications rely upon on-demand architecture, you can often run into cases where a function simply fails to respond. This can be due to a temporary issue on the provider, a bug in your code that is causing silent failures, or any of a number of potential reasons in-between.
Protecting against non-responding resources not only requires defensive coding on your part to ensure graceful degradation of the user experience, but additional monitoring to catch these scenarios when they happen. Monitoring characteristics like this will help you identify patterns in your application’s behavior, allowing you to potentially predict failures before they happen (as well as respond more quickly when they do occur).
Another item to keep an eye on is the cost of execution. Generally, serverless functions are only charged for the time during which they execute. Paying only for the processing power used helps you save money when your application is still growing, letting you focus on R&D and functionality instead of maintaining always-available infrastructure.
However, once activity in your application begins to grow, your costs can increase very quickly. While in an ideal world your costs will increase predictably along with the size of your user base, there are some scenarios to watch for that can lead to a hefty AWS bill at the end of the month.
A misconfigured Lambda function, for example, can end up using a processor that is much more powerful – and more costly – than your function actually needs. Furthermore, a denial-of-service attack can quickly cause your serverless compute usage to balloon as your attackers stress your back-end. Be sure to incorporate this into your monitoring to protect against sudden unexpected infrastructure bills.
Once your monitoring alerts you to potential issues, often your next step is finding out what, exactly, is going wrong with your code
Logs are crucial tools in this step. If properly used they can provide you with a ready snapshot of your application’s recent activity. In a traditional web application, these logs provide a dependable look at the sequence of events as they occurred in your application, helping you more quickly track down the events leading up to a failure and identify code that warrants further investigation.
Tracing through log activity becomes more complicated in a serverless context. Instead of a cohesive set of server calls that hit predictable, always-available hardware, the functionality of your application is split across multiple disparate machines. These each have their own separate logging mechanism, each of which must be investigated.
Without pre-work to ensure that you can cohesively trace an execution path through the logs of your application’s serverless function calls, you are often left with multiple views of small chunks of the application’s behavior. Identifying the trouble spots in your application becomes tougher, since the logs are no longer colocated by default and are grouped by function instead of execution path.
To work around this limitation, it’s important to spend development effort on creating a comprehensive distributed tracing system, allowing you to trace through your application’s execution. A distributed tracing system for your application can be as simple as adding a transaction wrapper that ensures every request shares a traceable ID, or implementing a means of aggregating logs from the different resources that govern your application’s serverless behavior, or even making use of third-party tools to provide a more coherent view of your application’s execution flow.
The right choice will depend on the implementation of your application, and as such needs to be accounted for during software architecture and design.
Monitoring functions in a serverless application can be challenging. Part of this is due to the newness of the field – what was trivial on a dedicated web server can become extremely challenging when that same functionality is split across dozens of ephemeral function instances.
Serverless function providers recognize this, fortunately, and offer some tools to help you create a picture of your application’s behavior. As we explore tools available, we’ll focus on monitoring AWS Lambda functions as their ecosystem represents approximately 77% of the serverless function market, but most serverless function providers offer similar tools with similar functionality.
Amazon CloudWatch is a dedicated tool for monitoring the performance characteristics of your application’s AWS-driven resources. CloudWatch aggregates statistics from your AWS resource usage, and provides logs, metrics, the capability to automate alerts, and more. Through use of CloudWatch you can see the activity being performed by your serverless functions, monitor resource usage to identify bottlenecks in your application architecture, and set up automated alerts for the riskier portions of your application. Cloudwatch will likely be at the core of your lambda monitoring system, giving you access to logs for AWS Lambda, monitoring memory usage, and reporting on general function health.
AWS X-Ray is a tool designed to help you more easily analyze and debug distributed applications. One of its key selling points is the ability to offer tracing for your application’s request, giving you the capability to follow the execution path of your application across the many different resources it consumes. It integrates deeply with many AWS services, and when fully implemented can help you identify bottlenecks in your application, troubleshoot erroneous behavior, and monitor excessive resource usage.
When coupled with CloudWatch and other monitoring tools, AWS X-Ray can give you a development environment that begins to approach the fidelity of a traditional web application, giving you the AWS Lambda monitoring you need to feel secure in your serverless application’s function..
Native monitoring tools can be very powerful in their own right, but they are not without their limitations.
For example, CloudWatch is an excellent tool for metrics and logs crucial to AWS Lambda monitoring, but these logs are distributed by Amazon’s resource IDs. Getting a full picture of your application’s call paths becomes more challenging, since often the information you need is split across the dashboards for multiple different serverless functions.
Another issue with native monitoring tools is the fact that they are locked into one ecosystem. You can monitor your lambda functions and set alarms based on their characteristics, for example, but if your application relies heavily on third-party tools you will miss potentially critical signals as your application runs.
Furthermore, your client-side code’s monitoring will also be exempt from these reports. If your application has frontend-based monitoring and logging, you’ll need to leverage a third party to incorporate this information into your application’s alerts.
Given how fast web development moves, oftentimes open source developers are able to deliver solutions to serverless architecture frustrations more quickly than the native providers are able to respond.
These products can vary from tools that make monitoring more developer-friendly all the way up to bleeding-edge serverless framework techniques. Below we’ll look at a few common open-source serverless monitoring tools, and see what they have to offer.
OpenTracing is a vendor-neutral open standard that works to define the means to implement a distributed execution tracing system.
By implementing an OpenTracing-compliant tracing system, you can gain a more complete view of your application’s behavior across the disparate resources it touches.
OpenTracing is a standard, as opposed to a product, meaning that you’ll need to do some implementation work in your application’s code to establish a tracing system that matches OpenTracing’s proposed standard. However, they do offer libraries in a number of languages that make this process straightforward and manageable.
One of the challenges of working with native tools like CloudWatch is that they often rely heavily upon internal patterns to represent their data. While this makes sense inside their own ecosystem, when trying to track down an issue these structures can introduce unnecessary clicks and user interface patterns that can greatly slow down the discovery process.
The CloudWatch Logs Shipper works to automate some of this grunt work, giving developers the capability to extract logs from CloudWatch and into a common back-end. Created by Yan Cui, Developer Advocate at Lumigo, this powerful tool gives developers the ability to analyze a serverless application’s entire log set without having to perform arcane configurations within the native provider’s tools. It can greatly reduce the debugging effort required to track down and identify issues in your serverless application. CloudWatch Logs Shipper will help level up your AWS Lambda monitoring, giving you better visibility into how your serverless functions interact.
Zipkin.io is an open-source distributed tracing system designed to integrate with your serverless functions. It includes a user interface for viewing your application’s distributed transactions, configurable storage for the logs generated, and a powerful query language.
Integrating Zipkin.io with your project is very flexible, with support for http integration, Kafka, and many other tool chains. The entire project code – including the Zipkin server – is available on their Github repo, allowing you to view your application’s behavior in a powerful user interface running on your own hardware.
While the open-source community has a lot to offer for serverless monitoring, there are a few limitations to be concerned with when implementing these tools.
All of these tools require potential integration with a third party. Integrations like these add processing time to an already distributed application
Features such as application tracing require adding code to all of your serverless functions manually – any missing functions will simply not appear in the end result
Some open-source communities can be volatile, occasionally introducing security concerns, or changing the root format of the underlying messaging
Open-source tools are third-party tools, meaning they will be missing potentially crucial pieces of information that are only available on a native provider’s platform
Open-source tools are unable to do things with AWS Lambda like monitor memory usage, unless you add code to explicitly report these metrics in your AWS Lambda monitoring system
While the pitfalls with serverless monitoring can be troublesome, they can be worked around in a number of ways. These can involve third-party tools, additional native monitoring tools, and different program architectures.
Below we’ll look at a few best practices as they apply to serverless applications, and see how they can help you build a more cohesive picture of your application’s behavior.
One of the most crucial parts of monitoring your serverless application is understanding how the parts tie together. You’ll likely begin with your code, which you’ll likely need to follow in a step-by-step manner in order to track down bugs and bad behavior – something that distributed tracing will make much easier.
Making use of native tools like AWS X-Ray will get you very far towards monitoring aws lambda functions, but as it is limited to AWS resources you’ll still encounter gaps in your distributed tracing coverage. To work around this, leverage an open-source tool like OpenTracing to create a cohesive view of the flow of logic in your application.
Coupled with copious logging and log aggregation, this can get you most of the way to the debugging fidelity of a traditional web application without significant impact to your codebase.
Native monitoring will provide you with an early warning system for the components that comprise your serverless application. By leveraging the tools your serverless function provider offers, you will be able to tie directly into their infrastructure and identify issues as they arise in real-time.
However, this will only give you a partial view of your application’s health. Use third-party tools to supplement your serverless monitoring, giving you a more complete view of your application’s health.
Oftentimes the user interfaces for native monitoring tools in a serverless provider’s ecosystem are complex due to the nature of the product. Things like CloudWatch logs are spread across multiple different AWS resources, for example, creating a nightmare when trying to build a distributed tracing system using native tools.
Additionally, errors lose a lot of context when filtered through native monitoring tools as your application runs. By centralizing this information with a third party, you can build a better picture of the current state of your application.
Make use of log aggregators like Logz.io to cut through the frustration of provider log UIs – this will also give you analysis tools that can help you build informative dashboards on your system’s behavior.
Coupled with an exception aggregator like Raygun, you can catch most of the edge cases that lead to bad behavior in your application, preventing issues before they have significant impact for your users.
While serverless applications can give you massive benefits in terms of application availability and scalability, these benefits can come at a significant cost of ease of maintenance.
With a system that is distributed by default, standard defensive development practices that worked fine in a traditional web application will require additional thought and consideration.
Given the importance of monitoring an application’s health, this means that you’ll want to dedicate significant effort to ensuring that your application is running smoothly. With the appropriate mix of third party and native tools, you can get monitoring fidelity rivalling that of a traditional web application.