Nowadays, microservice architecture is a pattern that helps to innovate quicker by enabling easier scalability, giving language flexibility, improving fault isolation, etc. Systems built this way also bring some downsides. Moving parts, concurrent invocations, and different retries policies can make operating and troubleshooting such systems challenging. Without proper tools, correlating logs with metrics may be difficult. To overcome these challenges, you need observability.
There are three types of telemetry data through which systems are made observable:
We will take a closer look at one of these telemetry data – traces, in a distributed system.
A trace exposes the entire journey of the action or request as it moves across all services in the distributed system. Trace is a collection of linked spans, which are operations representing a unit of work in the request, for example, database read.
Individual span can include:
In a monolithic application, a context within one service might be enough. In a distributed system, we need a method to send context from service to service to follow the execution flow. This process is called context propagation.
When the request begins, a trace is created. Every call needs to include a trace ID so it could be possible to reconstruct the trace. Context is included in the calls and transferred between services, typically via HTTP headers.
There are several specifications for context propagation that can be used, for example:
You are probably thinking – “This is great! I want it in my system!”.
But what tool or standard should you choose? First, we need to distinguish between collecting data and analyzing it.
There were/are two projects that could help:
In 2019 both projects merged into OpenTelemetry, which offers a set of APIs and libraries that standardize how you collect and transfer telemetry data. It is a single transmission method to your analysis system (also called backend).
Intentionally, OpenTelemetry is not an analytic tool. You can choose where to send data to be analyzed. Creators of this project presume that analytic tools are the place where the innovation should be, not the collecting methods.
OpenTelemetry has several benefits:
In March 2021, it was released as 1.0 version, claimed as production-ready.
We discussed that you need additional data in your services for context propagation.
Instrumentation is the process of adding traces and spans code to your application.
You can do it manually by using API and SDK provided by a client library and adding required code changes. In some cases, you can also use auto-instrumentation agents that generate spans automatically so that no code change is required.
Auto-instrumentation adds instrumentation code to the libraries and frameworks that you use. How it is done depends on the language. For example, it can pass a JAR to the JVM or wrap existing Go libraries.
In most cases, I would recommend combining the two, getting benefits from each:
OpenTelemetry is composed of the following components:
Using collectors is optional; however, it allows you to send the data to different backends without changing instrumentation or redeploying applications. The collector is the default location where the instrumentation libraries export their telemetry data. The collector is composed of receivers, processors and exporters, supporting different source and destination formats. For example, you can send traces data to Jaeger and metrics data to Prometheus.
AWS provides upstream-first, AWS-supported distribution of the OpenTelemetry. It extends the project with AWS specific elements, enabling easier auto-instrumentation and providing sending methods of metrics and traces to multiple AWS and partner monitoring solutions.
Examples of AWS enhancements:
AWS also provides an easy way to integrate AWS Lambda with OpenTelemetry. The collector can run as part of the Lambda extension along with the OpenTelemetry language SDK and send data to different backends.
It is all packaged into the Lambda layer, which provides a plug-and-play user experience by automatically instrumenting a Lambda function and including language-specific SDK with an out-of-the-box configuration for AWS X-Ray. The layer is available in multiple languages and can be configured to send trace data also to different destinations that support OpenTelemetry Protocol (OTLP).
You can add OpenTelemetry to your Lambda function following those few easy steps.
We will use Python 3.8 function and AWS X-Ray in this example, but you can do it with other supported languages and backend tools.
Try it yourself in a free AWS sandbox environment.