Observability is a measure of how well the internal state of a system can be inferred from its external outputs. It helps us understand what is happening in our application and troubleshoot problems when they arise. It’s an essential part of running production workloads and providing a reliable service that attracts and retains satisfied customers. It is even more important when it comes to serverless because we rely on managed services that we can’t simply “run on your machine”.
In my work as a consultant, many clients ask me about how to build observability into their serverless applications. If it’s better to build a custom solution using the native AWS services or to pay for a 3rd-party service such as Lumigo.
AWS offers a range of services that helps you build observability into your serverless application, such as CloudWatch, CloudWatch Logs and X-Ray. These services cater for your basic needs cheaply, but they lack the premium developer experience that specialized vendors such as Lumigo can offer.
For example, when compared with X-Ray, Lumigo offers a significantly better developer experience, including:
While the native AWS services have plenty of limitations and don’t offer the best developer experience out-of-the-box, you can work around these and still create a solution that gives you a lot of observability into your serverless application. For example:
All of these are doable but they take a significant amount of engineering time and often require coordination and agreement between different parts of the organization. Teams have to follow the same conventions such as log message format and the approach towards correlation IDs (their format, and how to propagate them) for the solution to work.
If the build vs buy decision is purely based on capabilities and what will make your developers more productive, then I think the most sensible approach is to use a 3rd party service that ticks most of your boxes and supplement them with the AWS services. This is the approach I have used in all of my recent projects and it has allowed me to focus on solving the client’s business challenges and deliver the desired outcomes on time and on budget.
In my projects, I use Lumigo as the main entry point when I try to understand what’s going on in my application and to troubleshoot issues. I seldom write custom log messages anymore because Lumigo captures most of the information I need to be able to infer the internal state of my applications.
Take the Lumigo dashboard, for instance. It gives me a lot of insights into the activities in my system. I can see at a glance if there is a high number of Lambda invocation errors and if so, which functions are failing. I can see hot spots and identify functions that are invoked frequently, these are good candidates for optimization (e.g. right-sizing the memory setting to reduce cost).
I can see Lambda functions where cold starts are an issue. Lambda functions that have a high frequency of cold starts are also good candidates for Provisioned Concurrency.
And I can also see which of the services (3rd party APIs, AWS services, or internal APIs) my application depends on are performing poorly.
The dashboard alone saves me hours of engineering time to collect, analyze and visualize this (very useful) information. The information it gives me helps me quickly find and investigate problems that arise in production. Problems that used to take me hours to find can now be identified in minutes!
In fact, when a problem does arise, I would usually receive an alert through Slack or PagerDuty and I will find the problem waiting for me in the Issues page in Lumigo.
From here, I can navigate to the erroneous transactions and interrogate them. Because every outbound request from the Lambda function is recorded in Lumigo, I can see how long they took and what was said in those communications. Together with the invocation event and the environment variables that were used in the invocation, I can quickly infer the internal state of the function when it was invoked.
As you can see, that gives me a lot of observability and it takes only a few minutes to set up, and I didn’t have to make any changes to my code. Instead, all of my energy and time can be better spent on tackling the problems that actually matter to my clients and deliver value to their businesses. Which is very much aligned with the spirit of serverless – to move away from undifferentiated heavy-lifting and use the most precious resources we have (engineering time) to solve the most valuable problems.
Building custom solutions to solve the problem of serverless observability is fun and it can be challenging in the best possible way! But they are also not what differentiates your business from your competitors.
Observability is an essential part of running a production application, and the good news is that with platforms like Lumigo, observability for serverless applications is easier than you think.
To learn more about how to overcome common challenges with building production-grade serverless applications, join us on our next webinar on Friday, 19th November.
I will be speaking alongside Ryan Jones, the founder of ServerlessGuru, and share many of the lessons that we have learnt running serverless applications in production over the last few years.
Hope to see you there!
If you wanna try out Lumigo, it also has a generous free tier (no expiration) that lets you trace 150k Lambda invocations per month, for free. Sign up now.