Instrumenting your code is essential to understanding your system’s performance and diagnosing issues as they arise. Traditionally, this was accomplished using proprietary vendor libraries, causing major lock-in. Enter OpenTelemetry.
OpenTelemetry is an open-source project that provides a set of APIs, SDKs, and integrations for instrumenting code. This article will introduce OpenTelemetry, explain the concept of OpenTelemetry resources and their importance, and discuss the state-of-the-art of OpenTelemetry resource attributes for AWS services.
OpenTelemetry (also known as OTel) is an open source project as part of the CNCF (cloud native computing foundation) tasked with standardizing the collection of telemetry, i.e., application performance data. It aims to provide a comprehensive set of APIs, SDKs, and collectors to collect data from any application on any platform. OpenTelemetry delivers a set of collector libraries to instrument applications written in any programming language. The OpenTelemetry community has created a set of exporters that can send data to a wide variety of monitoring systems.
At its heart, OpenTelemetry is committed to making telemetry data available to anyone, anywhere, at any time. It enables researchers to build tools that make sense of the data, and then allows developers to build applications based on that processed data.
A fundamental aspect about telemetry is to know which system it is coming from. Think of the error rate metric for the “POST /checkout” API. A high value, say, 50% failures in the past ten minutes is surely a reason for concern, right? Well, yes, if that telemetry is coming from production. But if it is coming from, say, a development environment, there is time to fix and analyze the issue.
We all instinctively know that telemetry and its evaluation is contextual on many things, like the system that generates it and the environment in which it resides, the time of the year (50% checkout failures on Black Friday is way worse than on January 1st), the importance of the system in the overall architecture and use cases and many other factors.
In OpenTelemetry, the way to represent which system generates a piece of telemetry is through resource attributes.
AWS resources, such as Amazon EC2 instances, Amazon ECS and And Amazon EKS workloads, and Amazon S3 buckets, are entities that you can collect telemetry data from. And while you collect that telemetry, you can annotate it with resource attributes that describe where you got the telemetry from.
Resource attributes are key-value pairs that describe the characteristics of your AWS resources. For example, you can use resource attributes like “cloud.region” and “cloud.account.id” to specify the region or account identifier, respectively, associated with an EC2 instance or a Lambda function. Semantic conventions define which resource attribute keys are suggested to be used for which type of information.
Semantic conventions are “community standards” that govern how data is structured and labeled so that it can be properly interpreted by the tools that consume it. OpenTelemetry instrumentations and resource detectors (the pieces of logic in your application that collect resource attributes, see the “Resource detectors for AWS resources” section) should adhere to the semantic conventions as much as possible.
As of the time of writing, the OpenTelemetry semantic conventions for resources that are relevant to describe resources running on AWS services are:
Note that there are other semantic conventions related with AWS concerning the “trace” and “metrics” parts of OpenTelemetry, e.g., how to annotate spans describing the sending or receiving of Amazon SQS messages or queries against Amazon DynamoDB. However, those semantic conventions are out of the scope of this article.
The semantic conventions of OpenTelemetry are implemented in “resource detectors”. Resource detectors are pieces of logic, run once as the OpenTelemetry SDK inside your application initialized, that scan the environment in which the application runs, e.g., looking up specific process environment variables that denote that the application runs on a particular cloud services (like the presence of the “ECS_CONTAINER_METADATA_URI” environment variable indicates that the application is running on Amazon ECS) or looking for particular files (for example, the container identifier is usually looked up inside Linux-based containers by scanning the “/proc/self/cgroup” virtual file).
In the previous section we covered the resource-attribute semantic conventions related with one or more AWS Services. But how do those semantic conventions map to compute and container-related services on AWS? The following table provides an overview of the OpenTelemetry SDKs that have resource detectors targeting specific AWS services, and what semantic conventions those resource detectors implement:
The table above does not mention OpenTelemetry SDKs like Ruby, Rust, C++ or Elixir that, at the time of writing, have no support for AWS-related resource attributes. It is interesting to notice that services like AWS AppRunner and Red Hat OpenShift on AWS (ROSA) are mentioned in semantic conventions, and specifically in the enumeration of valid “cloud.platform” values, but as of writing, there is no resource detector in the OpenTelemetry SDKs that sets those values.
Resource attributes are a key feature of OpenTelemetry that enables you to contextualize the telemetry you collect with metadata describing the system from which it is collected. This additional context is fundamental to understand the importance of insights that the telemetry provides you, and can make all the difference between an alert waking you up in the middle of the night, or having to wait after the third coffee tomorrow.
Insofar AWS is concerned, there is a number of semantic conventions that cover various aspects of some AWS services like Amazon ECS, Amazon EKS and Amazon EC2, but the support for collecting the resource attributes is spotty.
Lumigo provides support for OpenTelemetry through distributions for Node.js and Python, plus supports the use of OpenTelemetry Collectors as part of its adoption of industry standard best practices. Dive in with Lumigo’s docs today!