What is Serverless Observability?

Ideally, observability should help you understand the state of your application and how it performs unders different circumstances. However, while serverless observability may seem similar to serverless monitoring and testing, the three achieve different goals.

Testing helps you check your application for known issues, and monitoring helps you evaluate system health according to known metrics. Observability helps you search and discover unknown issues, providing end-to-end visibility.

Observability is typically achieved through the instrumentation of the application. The goal is to collect as much information needed to discover and remediate previously unknown issues. This information is critical to ensure the maintenance of the system.

However, it can be challenging to achieve observability in serverless applications. This is mainly due to the characteristics of event-driven functionality. Each function is disparate, operates in isolation, and is highly ephemeral. Introducing observability into this environment requires customization.

In this article

Monitoring vs Observability

The difference between monitoring and observability can be summarized as follows:

Monitoring is simply collecting relevant metrics about your application
Observability is putting a system in place that can let you proactively use those metrics to ensure the health of the application

The main reason to monitor a system is to maintain its health. After collecting system metrics like CPU utilization and network traffic, and error logs, you can use these as inputs to incident alert tools. By tracking errors, outages, and security incidents, you can alert relevant staff via alarms and notifications.

Observability of system outputs results in better insights that enable remediation of issues. Some critical outputs you should monitor include:

Error percentages throughout all function and container invocations
Cold starts incidence
Memory consumption
Outliers, or function calls that took longer than expected
Average duration of function executions

Through these and other merics, you will be able to identify and assess:

Chain reactions between components and other system parts
Microservice bottlenecks, identifiable by monitoring trace latency between a function call and dependent components
Error and bottleneck patterns for preventative purposes
Traffic flow troubleshooting to understand where bottlenecks lie in the integration between microservices
Application performance measurement and assessment
Long-term system performance, measured as number of invocations per function per period and success rate
Resource costs, which in a serverless deployment is directly tied to the number and duration of function executions

Challenges of Serverless Observability

Serverless observability is a challenge that requires a dedicated toolset. AWS provides several tools that can facilitate observability, including X-Ray and Cloudwatch. While the former produces a trace pillar—a timeline describing serverless transactions, the latter produces the metrics and logs. Aggregating the data remains the responsibility of the developer.

Many monitoring solutions use agents to collect data. Unfortunately, serverless functions do not provide a location within the container for agent installation, and AWS destroys containers periodically.

You could add code to your functions to gather data for monitoring, but even if you succeed in retrieving the information from a single function, it won’t provide data about the entire chain of events in the serverless architecture. Serverless functions are invoked based on events, typically received through an API gateway, and then saved to storage like DynamoDB.

To debug issues, you must be able to visualize the full lifecycle of a serverless transaction. Only an automated tool can deliver distributed tracing across multiple resources, both AWS-native and external databases or APIs.

Dedicated serverless monitoring solutions like Lumigo operate without an agent, gathering monitoring information and sending it to persistent storage during invocation. They can piece together monitoring data from across the serverless pipeline, providing end-to-end data needed to visualize and debug a serverless transaction.

Learn more about serverless monitoring in our detailed guides to:

Best Practices for Serverless Observability

Metrics in Observability

According to Google’s site reliability engineering (SRE) book, there are four golden signals to use when monitoring distributed systems—latency, traffic, errors, and saturation.

To do this, you need to create an efficient process that collects all metrics from your environment and delivers all metrics for aggregation and analysis. You should also add customized dashboards and alerts to the process.

Traditional monitoring involves using metrics like CPU, memory consumption spikes, latency across services, and traffic trends. However, these metrics are less relevant to serverless environments. You need to observe all operations related to each function and monitor all called application programming interfaces (APIs).

Logging in Observability

Metrics can notify you when issues occur, but without providing the information needed to troubleshoot the issue. This insight typically comes from logs, which record information about anything occurring in the application.

There are some common pitfalls to avoid when logging. Avoid manual logging—it is time consuming and does not provide enough information. Additionally, you should avoid logging out of context—without an organizational system in place, you might spend a lot of time looking for the right logs.

When logging, prefer automated processes that help you log as much information as possible. Additionally, add metadata to your log lines and index logs—this can help you quickly locate the right log and add an analytics tool that helps you sift through the information. You should also record custom metrics that are relevant to your unique business needs.

Related content: read our guides to AWS Lambda Logs and Serverless Logging

Distributed Tracing in Observability

Tracing can help you understand the entire lifecycle of actions or requests across several systems. A trace simulates the entire journey of the action or request as they move across all nodes in the distributed system. This information can help you better understand the overall health of the system, discover bottlenecks, and quickly resolve issues.

In serverless environments, tracing should be more oriented towards distribution, to ensure the entire lifecycle is truly captured. To do this efficiently, you need to instrument your code. This process may alter some calls. For example, HTTP/S calls are routed to a middleware that records the traced information.

However, note that instrumentation and tracing are long processes that require continuous maintenance. To ensure efficiency, you should add tags to traces. This can help you locate information more quickly and improve the analysis process.

Related content: read our guide to Distributed Tracing

Exciting news! Lumigo is joining Dash0!