The Hidden Costs of Serverless Observability

Home Blog The Hidden Costs of Serverless Observability

Observability purpose-built for serverless

Automatically correlate logs and metrics
No-instrumentation distributed tracing
Debug workloads in one central place

No code, 5-minute set up

Start Lumigo Free

DeveloperSteve , Dec 08 2022

The growing popularity of serverless architectures has led to an increased need for solutions to the modern challenges of microservice observability—one of the most critical components for running high-performing, secure, and resilient serverless applications.

Observability solutions have to break through the complexity of serverless systems, and with the right stack, observability enables not only fast and easy debugging of applications, but drives optimization and cost efficiency. With the wrong tool set, however, the cost of observability can rise unexpectedly, especially as applications grow in complexity and scale.

In this article, we will:

Explore the challenges that serverless architectures present to observability
Suggest possible solutions
Identify the hidden costs of these solutions
Discuss ways to mitigate such expenses

Why Is Observability a Challenge in Serverless Architectures?

When you build and test locally, you have complete control. Even well-known container orchestrators like Kubernetes allow you to set up and control clusters locally. This keeps observability relatively straightforward.

In serverless environments, not only are you abstracted from the underlying infrastructure that runs your code, but the distributed, event-driven nature of serverless applications makes traditional approaches to monitoring and debugging ineffective. Serverless functions are often chained through asynchronous event sources such as AWS SNS, SQS, Kinesis, or DynamoDB, making tracing extremely difficult.

A serverless approach structures an application in the way that makes the most sense for the underlying services from which it is built. This, in turn, makes monitoring much harder, since there isn’t one central, monolithic application server anymore. Instead, you are faced with a distributed architecture and an abundance of competing tools to conduct serverless monitoring.

Getting to grips with all these different services can be a cumbersome task. Take AWS as an example. AWS offers various services that are concerned with monitoring specific aspects of an application but also provides services—like AWS Lambda—that can be scattered over the whole architecture.

Then there are managed services like DynamoDB, SNS, and SQS, which are black boxes. They aren’t built or owned by the developer, which makes monitoring serverless applications from end to end a critical challenge.

Serverless Architecture; Serverless Monitoring Tools

If you build to support your familiar, local monitoring tools that aren’t optimized for serverless, you’ll end up trying to replicate the monolithic architectures which come with a much greater long-term expense.

As a simplified example, AWS Lambda bills per “gigabyte second” (time-to-run * memory used.) A monolithic Lambda in Node.js will not only have more functions, but will need to load more NPM packages to support their combined needs. If you’re using double the memory and twice as much time to run a single function in that monolith than to run the isolated function in a distributed architecture, you’re spending four times more than necessary for compute resources simply to support your monitoring tool.

Serverless observability tools are purpose built to support serverless architectures. You can reap all the benefits of serverless technology without sacrificing insights into the system, meaning that you can retain opportunities to catch errors and optimize for cost and responsiveness.

What Does It Take to Build Observability for Serverless Systems?

Building an observability solution tailored to a serverless system requires data, and a lot of it. Logs, metrics, and traces are required to ensure the system can answer as many questions as possible about how it’s working, whether you’re repairing or optimizing. Each source contributes a different perspective on how your application is performing and, when analyzed together, give a complete picture of its health.

Metrics

Metrics give you an overview of the service’s health and performance and answer essential questions about your application’s real-time performance, such as:

Can it keep up with the traffic?
Does it need more resources to reduce latency?
Can it cope with fewer resources to perform the same function?
Do some types of errors crop up repeatedly?

Traces

Traces follow events as they flow through your architecture from end to end. In a serverless system, services work together to complete a workload. When an error occurs, finding out which microservice caused it is crucial. The service that throws an error is not always the one that caused it, but if you know which services were triggered before or after it in a transaction and can tie those specific executions to their log entries, the source of the error becomes more apparent.

Logs

Finally, logs record in-depth information from the services active in your system, including metrics and tracing data. Logs can fill in details to help you uncover exactly what happened in your application’s services when the error occurred. But logs don’t come for free. You pay for ingestion, processing, and storing of the log data.

Why Is Building Serverless Observability Expensive?

Collecting, storing, and analyzing large volumes of telemetry data doesn’t come for free. Logging services, like Amazon CloudWatch, can be a hidden cost for serverless architectures. It also takes time and expense for engineers to build an effective and efficient observability tool. Let’s examine a few factors that make this task time consuming and expensive:

Implementing distributed tracing alone is an enormous task, requiring the linking together of all the resources that processed a request, rendering the resulting data in a visually understandable way.
After planning and implementing a custom observability stack, it must be run somewhere. Observability is a software service like any other, meaning its CPU, disk, and memory incur their own costs.
The more services you’re monitoring, the more log data they generate. As mentioned before, logs don’t come for free, and logging can become expensive if not configured correctly.

More data can provide deeper insights, but translates to higher costs. You might be willing to tolerate those costs with the expectation that the additional data can help you improve application performance. But if you have so much data the signal gets lost in the noise, the cost is multiplied rather than mitigated. Finding the right balance on your own is another expense.

Build or Buy?

The question is, should you build such a system yourself, or should you simply buy one? Buying entails using pre-built open source software because, while you might not pay for the software itself, you still have to pay someone to integrate it with your architecture, tune it, and possibly pay for ongoing maintenance/patching. On the other hand, we’ve just seen how complex, expensive, and time consuming building from scratch could be.

Lumigo falls into the “buy” category here. Lumigo’s observability solution with one-click tracing lets you debug serverless and containerized workloads in a central location, allowing you to maximize performance on our cost-efficient and visually understandable platform.

What Puts Serverless Applications at Risk of Experiencing High Observability Costs?

The primary risk factor for high serverless observability costs is writing too much data and keeping it for too long.

If you pump every little detail into stdout, CloudWatch Logs will write vast amounts of data for you to read later. You are charged by the quantity of data being written and stored, so your costs will rack up quickly with this strategy. As mentioned before, you might also create a poor signal to noise ratio that adds expense to sift the flood of data for meaningful insights.

Log retention is another major problem. The default retention time in CloudWatch Logs is set to infinite, which is good in one sense because no data will ever get lost. But usually (and ideally) you will react in a matter of hours or days to problems, so you don’t need to retain the logs from two years ago. Unless you change that setting in CloudWatch Logs, you’ll keep paying to store those irrelevant logs.

How to Mitigate High Observability Costs

Approaching observability with a serverless mindset can help keep your application’s observability tools cost efficient. This means that distributed systems—like serverless applications—should be using an observability approach tailored specifically for them. Traditional monitoring approaches that cater to virtual machines and monolithic on-premise deployments should be avoided.

Serverless systems are built using many different types of managed services, each with a singular purpose. The services’ different functions translate into differences in their monitoring needs. You have to keep this difference in mind when choosing an observability solution, whether you go the building or buying route. As a developer, you need insight into all of these services and their respective performance and limitations.

Let’s explore some ways you can minimize observability costs regardless of your choice of tool or platform.

Log Retention

As mentioned earlier, it’s important to ensure that you have configured your retention time in CloudWatch Logs correctly; otherwise, your costs will increase daily, since nothing is ever deleted. The default is never to delete anything because accidentally deleting data is such a no-go in IT. This shifts the decision of when to delete to the user. Only you know when your data isn’t needed anymore.

Automatic Instrumentation

It’s worth implementing automatic instrumentation of your Lambda functions and capturing this data with an OpenTelemetry SDK. This way, you know what is happening inside your functions, and the data will be saved in a standardized way so you can analyze it with the tools of your choice. With a tool like Lumigo, you can then visualize, analyze, and debug your data.

Distributed Tracing

Distributed tracing links together all the services that worked on a specific request. This allows you to see how events flow through your architecture and identify points where things might go wrong. Open-source tools like OpenTelemetry and Jaeger can help to a degree, but they aren’t efficient because of their demanding setup and maintenance requirements.

Solutions like Lumigo, however, offer the benefits of distributed tracing without lengthy setup or time-consuming maintenance.

Central Dashboards

As explained above, AWS already gives you most of the data you need, but makes you bounce back and forth between multiple service control panels when trying to fix a bug. Lumigo offers a single pane of glass experience that allows you to open just one dashboard and see everything you need to optimize application performance.

Lumigo allows you to monitor your complete stack—from serverless functions to APIs and other services—in the context of requests. The Lumigo Dashboard provides an up-to-date view of the entire serverless architecture in one place, without the need to switch context or tools.

Summary

Building, running, and maintaining an observability solution can be a major hidden cost in running a serverless application.

Choosing an observability solution designed specifically for serverless contexts will help you reap all the benefits that serverless offers. Generic monitoring tools will gather data at a VM or container level and might drown actionable, service-specific information in a sea of CPU and memory metrics. You also end up paying to store all this data, even if you don’t need it.

Lumigo focuses on data that’s important for each of these services, providing convenient access to it in a central location. You only pay for what you need, and you can easily locate important data with just a few clicks.

This may also interest you

Erez Berkner, Apr 25 2019