OpenTelemetry is an open source framework for creating and managing telemetry data, including metrics, logs, and traces. It provides tools, SDKs, integrations, and APIs that enable a vendor-agnostic implementation, enabling you to send telemetry data to existing monitoring and tracing systems, known as “backends”.
OpenTelemetry is not a full observability platform, as it does not give you a way to process, store and query telemetry. Rather, it lets you collect and export data from applications to various commercial and open source backends.
OpenTelemetry offers a pluggable architecture that enables you to add technology protocols and formats easily. Supported open source projects include the metrics format of Prometheus, which allows you to store and query metrics data in a variety of backends, and the trace protocols used, among others, by Jaeger and Zipkin for storing and querying tracing data.
This is part of an extensive series of guides about Observability.
Related content: Read our guide to opentelemetry vs opentracing.
In this article
OpenTelemetry standardizes the process of collecting telemetry data. In the past, developers had to use different tools for different programming languages, or even different tools within the same language. This made it difficult to coordinate observability efforts across a diverse software ecosystem.
OpenTelemetry provides a standardized set of APIs and libraries that work across a wide range of languages and platforms. This means you can use the same tool to collect and analyze data from all your software components. OpenTelemetry also uses a common format for all telemetry data, which makes it easier to aggregate, correlate, and analyze data from different sources.
Another significant benefit of OpenTelemetry is its vendor-agnostic nature. This means that it’s not tied to any specific monitoring or observability vendor. Instead, it provides a neutral, open-source framework that you can use with any vendor’s solutions.
This has several advantages. First, it gives you the freedom to choose the monitoring solution that suits your needs, without being locked into a particular tool or platform. Second, being vendor-agnostic means that OpenTelemetry can integrate with a wide range of other tools and frameworks.
OpenTelemetry goes beyond basic metrics and logs to provide a rich, detailed view of your software’s performance. It can capture data including traces, metrics, and logs, as well as more advanced data types like histograms and summaries.
OpenTelemetry’s data capture is also highly customizable. You can choose exactly what data to collect, how to collect it, and how to analyze it. This lets you tailor your observability strategy to your specific needs and goals.
OpenTelemetry doesn’t just collect data, it also provides powerful tools for handling and analyzing that data. It includes a flexible and extensible data processing pipeline, which you can customize to fit your needs.
This pipeline allows you to filter, aggregate, and transform your telemetry data in a variety of ways. For example, you can use it to remove sensitive information from your data, aggregate data from multiple sources into a single view, or transform raw data into more useful and meaningful formats.
Telemetry data consists of any output collected from system sources for observability purposes. This data is analyzed together to view the dependencies and relationships within a distributed system. The three main data classes are currently referred to as the “three pillars of observability” and include logs, metrics, and traces; although the hope is that, with time, OpenTelemetry may grow to be able to collect other types of telemetry data like profiling and end-user monitoring.
A log is a textual record of a specific event that occurred at a specific point in time. The trigger to generate the log entry is part of the code of the application, so systems produce log entries repeatedly when the relative code is executed. The entry records the time of the event and provides a payload including a message to describe the nature of the event, context about that event, and additionally other metadata that can be useful later for analysis.
Depending to how logs are created, which formatting rules are used, and how easy it is for automated logic to process them, logs can be broadly categorized as follows:
Logs offer a reliable, easy-to-grasp source of information about an application’s behavior. Developers rely on logs when troubleshooting code and verifying its execution. This data may provide the fine-grain information needed to identify the root cause of system failures and other issues when the failure is located in a specific component of the overall application, but it may not always suffice in understanding where faults originate in a distributed system, and what are instead side-effects.
Note: Logs are one of the newest parts of the OpenTelemetry specification and are still undergoing major change.
A metric is a series of data points with a value associated with timestamps, which has led to the word “timeseries” to be largely considered a synonym for “metrics”. The value of data points are often numeric, e.g., the count of how many requests served within a certain timeframe, but in some monitoring systems, it can also be strings (e.g., the “INFO” metrics of Prometheus) or booleans.
In order to reduce the amount of computing resources to store and process metrics over long timeframes, it is common practice to “aggregate” their values, for example reducing the granularity of a metrics from having one data point every second, to storing the average, mean and (in some cases, percentiles) of data points over a minute or ten.
Since metrics tend to include less sensitive data than logs, it is more commonplace for infrastructure providers and third party services to provide metrics about what they do on a user’s behalf than logs.
A trace describes the entire journey a request makes across a distributed system. As requests make their way into the system, the components processing them create spans, which document operations like “received request XYZ” or “issued database query ABC”, at which point in time the operation began, and how long it took to complete.
Spans are grouped by their trace identifier and link to their predecessor spans, effectively creating a directed, acyclic graph of spans as the processing of a request is carried out in the distributed system. To the fine granularity of information collected in a trace, it is usually possible to see at a glance where errors and latency in processing one request originate, and how they spread across the distributed system.
Span typically consists of the following data:
The value of a trace goes beyond troubleshooting one single request. For example, by aggregating data across multiple traces, one can generate metrics in terms of rate, errors, and duration (RED) form, which are a large part of the so-called “Golden Signals” in the Site Reliability Engineering (SRE) practice as originally defined at Google.
Learn more in our detailed guide to opentelemetry tracing.
Related content: Read our guide to opentracing.
OpenTelemetry consists of several components, including cross-language specification, per-language SDKs, tools for collecting, transforming, and exporting telemetry data, and automatic instrumentation and contrib packages. You can use the provided components instead of vendor-specific SDKs and tools.
Image Source: OpenTelemetry
This component provides a vendor-agnostic proxy for receiving, processing, and exporting telemetry data. It offers collector contrib packages that enable you to receive telemetry data in various formats, including OTLP, Prometheus, Zipkin and Jaeger, and send it to several backends, sometimes in parallel (e.g., for redundancy reasons). It also enables you to process and filter telemetry data before it is exported.
Learn more in our detailed guide to the OpenTelemetry Collector
OpenTelemetry provides language SDKs that enable you to use the OpenTelemetry API to generate telemetry data with a certain programming language and export this data to a specific back-end. OpenTelemetry SDKs is the foundation for the automated instrumentation for popular frameworks and libraries that comes with OpenTelemetry contrib packages, and also enables you to write bespoke instrumentation within your application, for example, to trace in-house frameworks that are not supported by the OpenTelemetry community.
OpenTelemetry supports various components that generate telemetry data from widely-adopted frameworks and libraries for supported languages. For example, outbound and inbound HTTP requests from an HTTP library generate data about those specific requests.
The way automatic instrumentation is applied to an application differs between languages due to the differences in the underpinning runtimes.
One language may require using a component loaded alongside the application, while another may prefer pulling a package explicitly in the codebase. Coverage of an ecosystem, that is how many of the popular libraries and frameworks have automatic instrumentation, is also different across languages.
Exporters enable the OpenTelemetry implementation in an application to upload telemetry to one or several preferred backends. An exporter works by decoupling the instrumentation from your backend configuration, making it easier to change backends without changing the instrumentation you added to the code to extract the data. Moreover, since exporters effectively translate the OpenTelemetry data to another format, e.g., the Jaeger trace protocol, one can effectively have the same data into different backends by just adding more exporters.
Learn more in our detailed guide to OpenTelemetry Architecture
While OpenTelemetry is a critical enabler of observability for many organizations, it is not without its challenges.
As you explore OpenTelemetry, you’ll find that the types of data that can be collected and processed are limited. OpenTelemetry supports a defined set of data types—traces, metrics, and logs—and other data types are not natively supported.
If you need to gather and analyze a type of data that OpenTelemetry doesn’t support, you might have to rely on other tools or systems to collect and analyze this data. This goes against the main tenet of OpenTelemetry, which aims to provide one set of observability APIs.
The second challenge you’ll encounter with OpenTelemetry is its complexity. OpenTelemetry can have a steep learning curve and might be difficult for developers without expertise in observability. Setting up and configuring OpenTelemetry is also a non-trivial effort You’ll need to configure numerous components, including the collector, exporters, and instrumentation.
In addition, the project is evolving rapidly, with new features and updates being added regularly. This can make it difficult for you to keep up with the changes.
OpenTelemetry becomes more difficult to configure and manage at large scale. Managing large-scale OpenTelemetry deployments requires close monitoring of their health, troubleshooting, and ongoing adjustment of configurations. This can be a challenge for smaller teams, and a burden for larger ones.
OpenTelemetry is the successor to both the OpenTracing and OpenCensus projects. The Cloud Native Computing Foundation (CNCF) supported OpenTracing, and Google OpenCensus. In 2019, the two projects decided to merge into OpenTelemetry, which became an incubating CNCF project.
OpenTracing is a discontinued project that provided a vendor-agnostic API for the writing distributed tracing instrumentation and a set of semantic conventions to have consistency in the telemetry produced. Unlike OpenTelemetry, OpenTracing was not an implementation, but rather a set of interfaces that other implementations, like Jaeger or Zipkin, could implement to increase portability. OpenTracing was initially released in 2016 and, while it is no longer developed and has never reached a stable release, it is still integrated in popular software and implemented by various tracer implementations.
OpenCensus provides libraries to collect application metrics and distributed traces and transfer data in real-time to a backend. It was initially created by Google and introduced into its internal census library and was released in 2018 as an open source tool. Unlike OpenTracing, OpenCensus is not formally discontinued, and some support and security patching is being provided for the foreseeable future.
OpenCensus has implementations in various languages. The metrics the tool collects use the same propagation tags and metadata, which is an idea that has lived on in OpenTelemetry with the concept of “resource”. OpenCensus collects metrics and trace data for certain processes connected to the back end irrespective of the formats and output. Applications typically import and use desired exporter metrics specific to each application.
Learn more in our detailed guide to OpenTelemetry vs. OpenTracing
Prometheus is an open source tool for monitoring time-series data. It was initially developed by SoundCloud in 2012 and later got accepted into the CNCF. You can use Prometheus to collect, process and query metrics.
Here are key differences between OpenTelemetry and Prometheus:
Learn more in our detailed guide to OpenTelemetry vs Prometheus
This example auto-instruments a Node.js app and emits metrics to the console. This tutorial is based on the OpenTelemetry quick start guide.
To get started with OpenTelemetry in Node.js, install the following npm packages:
npm install @opentelemetry/sdk-node @opentelemetry/api
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/sdk-metrics-base
We’ll use the following example application. Save this code as app.js:
const express = require("express");
const PORT = process.env.PORT || "8080";
const app = express();
app.get("/", (req, res) => {
res.send("Hello World");
});
app.listen(parseInt(PORT, 10), () => {
console.log(`Listening for requests on http://localhost:${PORT}`);
});
Now run the application using node app.js and make sure it is listening on localhost port 8080.
We’ll use the following JavaScript code to auto-instrument the application and allow OpenTelemetry to emit metrics. Create a file with a name like trace.js in your project folder.
The trace.js file first imports dependencies:
const opentelemetry = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { diag, DiagConsoleLogger, DiagLogLevel } = require('@opentelemetry/api');
Sets log level to debug to enable troubleshooting:
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
And initializes the OpenTelemetry SDK and defines a ConsoleSpanExporter to output metrics data to the console:
const sdk = new opentelemetry.NodeSDK({
traceExporter: new opentelemetry.tracing.ConsoleSpanExporter(),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start()
Next, we’ll create an OpenTelemetry Meter to monitor metrics from the Node.js application. We’ll create a file named monitoring.js and add the following code.
The monitoring.js file sets up a MetricExporter and imports the sdk-metrics-based library:
'use strict';
const { MeterProvider, ConsoleMetricExporter } = require('@opentelemetry/sdk-metrics-base');
Creates a MeterProvider which lets you create metrics of your choice:
const meter = new MeterProvider({
exporter: new ConsoleMetricExporter(),
interval: 1000,
}).getMeter('your-meter-name');
And adds a simple counter metric that counts incoming requests to the application:
const requestCount = meter.createCounter("requests", {
description: "Count all incoming requests"
});
const boundInstruments = new Map();
module.exports.countAllRequests = () => {
return (req, res, next) => {
if (!boundInstruments.has(req.path)) {
const labels = { route: req.path };
const boundCounter = requestCount.bind(labels);
boundInstruments.set(req.path, boundCounter);
}
boundInstruments.get(req.path).add(1);
next();
};
};
Here is how to import and use this code in the Node.js application. We’ll add this code at the top of app.js:
const express = require("express");
const { countAllRequests } = require("./monitoring");
const app = express();
app.use(countAllRequests());
Now, every time a user makes a request to the application, the meter will count the request.
Run the sample application using the command node app.js.
Point your browser to the address http://localhost:8080, and you’ll see the metric displayed in the console by the ConsoleMetricExporter, as follows:
{
"name": "requests",
"description": "Count all incoming requests",
"unit": "1",
"metricKind": 0,
"valueType": 1
}
{ "route": "/" }
"value": "1"
In OpenTelemetry, attributes are key-value pairs that provide context for distributed tracing, metrics, logs or resources. Resources are a representation of the component that emits telemetry, like a process in a container. Attributes enable teams to capture additional data to find meaningful correlations, e.g., in the face of performance changes. Whether for root cause analysis or forward-looking performance optimization, attributes can help filter, search, visualize, and aggregate telemetry data.
Here are a few types of attributes you can use to improve observability:
If you are using attributes in an organization with multiple teams and codebases, it is very important to consistently adopt attributes. Without this standardization, troubleshooting issues across team and codebase boundaries becomes far more complex and confusing.
Cardinality is a measure of the number of dimensions in which telemetry data is likely to be recorded and queried. Attribute values and their indexing are the largest source of increased cardinality and, depending on the backend storing the data, it may require much more storage or slow down queries significantly.
One of the biggest benefits of OpenTelemetry is that it enables vendor-agnostic instrumentation. All telemetry calls made by your application are made through the vendor-independent OpenTelemetry API.
To keep this vendor independence, it is important to keep the provider configuration at the top level of your application or service (usually at the entry point). This decouples the OpenTelemetry instrumentation from instrumentation calls, allowing you to choose the tracing framework that best suits your use case without changing your instrumentation code. By decoupling provider configuration from instrumentation, you can easily switch providers using flags or environment variables.
In a continuous integration (CI) environment where you run integration tests, you may not want to run a tracing provider at all, to reduce cost and complexity. For example, in local development it might be enough to trace metrics using an in-memory export, while in production it is necessary to use a hosted SaaS service for tracing. Separating provider initialization from instrumentation makes it easy to switch providers based on your environment.
In most cases, unit tests focus on program logic and ignore telemetry. This may lead to your telemetry being unusable when you most need it.
OpenTelemetry SDKs provide in-memory exporters, which let you query telemetry data collected during unit tests. The use of these constructs is not documented in most languages, so the best place to find examples of their use is in the OpenTelemetry unit tests for each project.
Related content: Read our guide to opentelemetry golang
Related content: Read our guide to opentelemetry spring boot
Related content: Read our guide to opentelemetry operator
Related content: Read our guide to opentelemetry instrumentation
OpenTelemetry offers a pluggable architecture that enables you to add technology protocols and formats easily. Using OpenTelemtry, Lumigo provides containerized applications with end-to-end observability through automated distributed tracing.
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of observability.
Authored by Lumigo
Authored by Lumigo
Authored by Komodor