Mar 03 2020
Next in our series on the Amazon Builders’ Library, Lumigo Director of Engineering – and newly-minted Serverless Hero – Efi Merdler-Kravitz picks out the key insights from the article, Instrumenting distributed systems for operational visibility, by AWS Principal Engineer (AWS Lambda), David Yanacek.
About the Amazon Builders’ Library
The Amazon Builders’ Library consists of a collection of articles written by principal engineers at Amazon that explain how Amazon builds scalable and resilient systems.
Disclaimer: some nuances might be lost in this shortened form. If you want to learn about the topic in more detail then check out the original article.
Instrumenting distributed systems for operational visibility
Having just finished reading David Yanacek’s excellent deep dive, “Instrumenting distributed systems for operational visibility”, here is my mini summary, along with some of my own observations.
– Using instrumentation for understanding how a system works is a great idea, and today’s tools are even able to create a real map of the various resources that interact with each other.
– You can’t use cat, grep, sed, and awk on your serverless application, you definitely need another set of tools
– Instrumentation frees you from the hassle of logging every statement you add. It will automatically record the most important data points, and that will allow you to debug crashes or improve performance.
– At the heart of instrumentation lies the trace ID, which is a unique identifier that is passed between the various services. Although not mentioned in the article, asynchronous and many-to many-execution (e.g. Batch write->Kinesis->Batch read) adds an extra layer of complexity to the tracing.
– Old-fashioned logs are still important, as not everything can be instrumented. One such example would be inner algorithm flow. However, it’s very important to correlate the logs with that special trace ID. When pulling the instrumentation details, also make sure to pull the relevant logs
– Logs are expensive and rather complex to handle. Aside from huge enterprises, or financial institutions subject to strict security controls, most businesses are better served allowing others to handle the logs for them.
– Creating alarms is important, but the trickiest part is defining their threshold. Start with a number that makes sense, then tune it as time goes on. Don’t worry, you’ll rarely choose the right threshold at the first attempt.
– Log units of work, for example an http request or a single cron run. Aggregate multiple stages in a single unit of work to a single concrete log, however do log “progress” when the unit of work is long.
– Record the input before performing any kind of manipulation on it.
– Trim big requests, pull important details, and drop the rest. For example, requests arriving from API Gateway to Lambda proxy might contain uninteresting details, so log only the body. As a rule of thumb, at Lumigo we trim to 1K, although it’s configurable.
– Have an easy way to change log level. In a Lambda, using an environment variable to set the log level is a very simple way of achieving this.
– Log the latency of all requests. It will help you determine performance issues. At Lumigo we also log the request body and response.
– Log queue depth when interacting with one: logging how many items are in the queue when pulling or pushing to it will help you pinpoint latency issues or scalability problems.
– Group error metrics by type. Don’t use a single metric to capture all errors. Grouping errors will allow you to handle the most prevalent ones first.
– Protect your logs. This is yet another reason to use external services that support advanced security features like encryption, MFA, access control, auditing, and so on.
– Avoid writing sensitive data in the logs, and in general, choose external log services that have good security/privacy certificates like GDPR, SOC2, etc.
– Prune your logs. Over time you’ll find log statements that you thought made sense, but are now just littering your log stream. Remove them! Make it a habit to go over your logs on a weekly or monthly basis, and remove unneeded ones.
– If you’re using CloudWatch don’t forget to set retention periods, as you don’t want to keep logs there indefinitely. In cases where you require long term log storage, use more cost-effective services like S3 Glacier.
– Use CloudWatch Insights or Athena (or another SaaS offering) to search your logs. At Lumigo we avoid solutions that are not serverless, so managing our own ELK is out of the question.
– Make sure that you have the same logs and metrics infrastructure in your Dev, QA and integration environments, you’ll want to test that the metrics you emit are correct and the logs make sense.