Observability is a measure of how well we are able to infer the internal state of our application from its external outputs. It’s an important measure because it indirectly tells us how well we’d be able to troubleshoot problems that will inevitably arise in production.
It’s been one of the hottest buzzwords in the cloud space for the last 5 years and the marketplace is swamped with observability vendors. Different tools employ different methodologies for collecting data. Some require manual instructions while others offer a more automatic process.
The quality and quantity of the data they collect (e.g. logs, distributed traces) also vary greatly. Which directly impacts how much we are able to infer about the internal state of our application and what it’s doing.
In this post, let’s compare two of the most popular observability solutions for serverless applications in AWS — AWS X-Ray and Lumigo.
X-Ray can collect traces from your application and reveal useful insights:
In some instances, X-Ray is able to trace these transactions through multiple Lambda functions and/or containers.
Unlike many third-party services that aim to provide a complete observability package, and give you a single pane of glass to access all the relevant data. X-Ray is limited in scope and cannot provide all the insights you need to troubleshoot issues in your application.
X-Ray is able to integrate with numerous AWS services in a number of ways. The most common way is for you to actively instrument your application code to generate the traces and for other supporting AWS services (such as SNS and EventBridge) to participate in the request tracing passively (by propagating the trace headers).
In order to capture requests to other AWS services, you need to instrument the AWS SDK using the X-Ray SDK.
const XRay = require('aws-xray-sdk-core')
const AWS = XRay.captureAWS(require('aws-sdk'))
If you only need one of the AWS clients, e.g. DynamoDB, then you can also instrument individual AWS clients.
const XRay = require(‘aws-xray-sdk-core’)
const DynamoDB = AWSXRay.captureAWSClient(new AWS.DynamoDB())
But what if you’re calling other HTTP endpoints such as Twillio or Stripe, or your own APIs?
To make sure these are included in the X-Ray traces, you will also need to instrument Node.js’s http module too.
const AWSXRay = require(‘aws-xray-sdk-core’)
AWSXRay.captureHTTPsGlobal(require(‘https’))
Without instrumenting your code, the X-Ray traces would be barebone and not add much value at all.
It’s also worth noting that X-Ray has a default sampling limit of 1 req/s and then a fixed rate of 5%.
X-Ray is very cost-effective compared to other solutions in the market. But it has many limitations, and in practice, I don’t find a lot of value in X-Ray. This is especially true for complex applications where the failure modes are also more complex.
I would summarize X-Ray as “a cost-effective solution that meets the basic needs of distributed tracing”. However, distributed tracing is not the end goal. The goal is to gain observability in your application so you can deal with any problems that arise.
In a way, you get what you pay for with X-Ray.
Lumigo is an observability tool that helps you troubleshoot problems in serverless and container applications. It has similarities to X-Ray in that it gives you distributed tracing but it does so much more than that.
It’s a much more complete solution and caters for all your needs when it comes to observability and troubleshooting live issues:
To get started, head over to lumigo.io and open a free account. This would guide you through the process of creating an IAM role in your account, which gives Lumigo the read access it needs.
Without doing anything else, you can get some high-level insight into your Lambda functions across all the regions:
You can also enable tracing on individual functions in the Lumigo console:
Doing so lets you trace through even complex transactions spanning across multiple Lambda functions and different event sources. And you can see all the relevant Lambda logs side-by-side so everything is in one place.
You can click on each of the icons in the transaction to learn more. For example, every traced Lambda invocation records the Lambda invocation event, return value, environment variables and logs.
Every HTTP request (to other AWS services, or 3rd party APIs, or your own microservices) from these Lambda functions is automatically captured along with the request and response body. These give you regular checkpoints and let you infer the state of your application before and after that HTTP request was made.
And when problems inevitably occur, Lumigo would automatically capture them and group the errors by Lambda function and error type. You can see all the errors in the relevant time window at a glance and see their frequency and trend over time.
If you had configured a notification channel then you would be notified of errors and timeouts straight away.
As far as serverless goes, Lumigo is the best observability tool in the market and a big part of my observability strategy as I explained in this post. It stands head and shoulder above X-Ray in terms of the value it provides.
The fact that it gives you so much observability into what’s going on in your application makes it just as valuable during development as it does in production. I can run an end-to-end test, find an error, and then quickly jump into Lumigo and look at the transaction to figure out what went wrong.
Since I started using Lumigo, I have been able to delete most of the debug log statements that I used to litter throughout my codebase. The fact that the Lambda invocation event is captured along with every HTTP request I make means I no longer have to do it myself.
Although X-Ray is very cost-effective and has some advantages as a first-party service. I have seldom found it useful outside of the context of AppSync and Step Functions because of its ability to trace through direct integrations to other AWS services. However, even in those cases, I had to rely on random encounters to stumble upon interesting traces that represent clear outliers. For example, traces that have a very long duration.
Finding traces for specific transactions or GraphQL operations is nigh on impossible in X-Ray. And in most cases, it doesn’t capture enough information about the transaction. The duration of the captured HTTP requests can help identify performance problems. But the fact that request and response bodies are not captured means you can’t infer the state of the application at the time when the HTTP request was made.
In comparison, Lumigo provides a clear value proposition and is in my opinion the best observability tool for serverless applications. It has been a game changer for my developer experience working on serverless applications and one of the most important tools in my toolbox. You can sign up for Lumigo for free here.