OpenTelemetry: Concepts, Architecture, and a Quick Tutorial

  • Topics

What Is OpenTelemetry?

OpenTelemetry is an open source framework for creating and managing telemetry data, including metrics, logs, and traces. It provides tools, SDKs, integrations, and APIs that enable a vendor-agnostic implementation, enabling you to send telemetry data to existing monitoring and tracing systems, known as “backends”.

OpenTelemetry is not a full observability platform, as it does not give you a way to process, store and query telemetry. Rather, it lets you collect and export data from applications to various commercial and open source backends.

OpenTelemetry offers a pluggable architecture that enables you to add technology protocols and formats easily. Supported open source projects include the metrics format of Prometheus, which allows you to store and query metrics data in a variety of backends, and the trace protocols used, among others, by Jaeger and Zipkin for storing and querying tracing data.

In this article:

  • What is Telemetry Data?
    • Logs
    • Metrics
    • Traces
  • OpenTelemetry Architecture and Components
    • Collector
    • Language SDKs
    • Automatic Instrumentation
    • Exporters
  • OpenTelemetry vs OpenTracing
  • OpenTelemetry vs Prometheus
  • Tutorial: OpenTelemetry Node.js Quick Start
    • Prerequisites
    • Step 1: Create the Example Application
    • Step 2: Instantiate Tracing
    • Step 3: Monitor a Metric
    • Step 4: Run the Application
  • OpenTelemetry Best Practices
    • Using Attributes
    • Use A Shared Attribute Library
    • Leverage Cardinality
    • Keep Initialization Separate from Instrumentation
    • Unit Test Tracing Using Memory Span Exporters
    • OpenTelemetry Collector Deployment
  • Microservices Monitoring with Lumigo

What is Telemetry Data?

Telemetry data consists of any output collected from system sources for observability purposes. This data is analyzed together to view the dependencies and relationships within a distributed system. The three main data classes are currently referred to as the “three pillars of observability” and include logs, metrics, and traces; although the hope is that, with time, OpenTelemetry may grow to be able to collect other types of telemetry data like profiling and end-user monitoring.

Logs

A log is a textual record of a specific event that occurred at a specific point in time. The trigger to generate the log entry is part of the code of the application, so systems produce log entries repeatedly when the relative code is executed. The entry records the time of the event and provides a payload including a message to describe the nature of the event, context about that event, and additionally other metadata that can be useful later for analysis.

Depending to how logs are created, which formatting rules are used, and how easy it is for automated logic to process them, logs can be broadly categorized as follows:

  • Unstructured logs—includes text that is written for humans to process, and it may not include metadata that is easy to process for machines. It is generally considered the most common approach to logging, and unfortunately it is usually hard to parse for analysis.
  • Structured logs—includes data organized into a standard format with a structure that is easy to parse for other code (e.g., JSON). It includes additional metadata that makes querying (especially filtering and grouping) logs easier.

Logs offer a reliable, easy-to-grasp source of information about an application’s behavior. Developers rely on logs when troubleshooting code and verifying its execution. This data may provide the fine-grain information needed to identify the root cause of system failures and other issues when the failure is located in a specific component of the overall application, but it may not always suffice in understanding where faults originate in a distributed system, and what are instead side-effects.

Note: Logs are one of the newest parts of the OpenTelemetry specification and are still undergoing major change.

Metrics

A metric is a series of data points with a value associated with timestamps, which has led to the word “timeseries” to be largely considered a synonym for “metrics”. The value of data points are often numeric, e.g., the count of how many requests served within a certain timeframe, but in some monitoring systems, it can also be strings (e.g., the “INFO” metrics of Prometheus) or booleans.

In order to reduce the amount of computing resources to store and process metrics over long timeframes, it is common practice to “aggregate” their values, for example reducing the granularity of a metrics from having one data point every second, to storing the average, mean and (in some cases, percentiles) of data points over a minute or ten.

Since metrics tend to include less sensitive data than logs, it is more commonplace for infrastructure providers and third party services to provide metrics about what they do on a user’s behalf than logs.

Traces

A trace describes the entire journey a request makes across a distributed system. As requests make their way into the system, the components processing them create spans, which document operations like “received request XYZ” or “issued database query ABC”, at which point in time the operation began, and how long it took to complete.

Spans are grouped by their trace identifier and link to their predecessor spans, effectively creating a directed, acyclic graph of spans as the processing of a request is carried out in the distributed system. To the fine granularity of information collected in a trace, it is usually possible to see at a glance where errors and latency in processing one request originate, and how they spread across the distributed system.

Span typically consists of the following data:

  • Trace identifier
  • Span identifier
  • The operation’s name
  • A start and end timestamp
  • Metadata in key-value format, encoding information about the infrastructure (e.g., which container processed this request), etc.
  • Events (e.g., logs, exceptions and errors)

The value of a trace goes beyond troubleshooting one single request. For example, by aggregating data across multiple traces, one can generate metrics in terms of rate, errors, and duration (RED) form, which are a large part of the so-called “Golden Signals” in the Site Reliability Engineering (SRE) practice as originally defined at Google.

OpenTelemetry Architecture and Components

OpenTelemetry consists of several components, including cross-language specification, per-language SDKs, tools for collecting, transforming, and exporting telemetry data, and automatic instrumentation and contrib packages. You can use the provided components instead of vendor-specific SDKs and tools.

Image Source: OpenTelemetry

Collector

This component provides a vendor-agnostic proxy for receiving, processing, and exporting telemetry data. It offers collector contrib packages that enable you to receive telemetry data in various formats, including OTLP, Prometheus, Zipkin and Jaeger, and send it to several backends, sometimes in parallel (e.g., for redundancy reasons). It also enables you to process and filter telemetry data before it is exported.

Learn more in our detailed guide to the OpenTelemetry Collector

Language SDKs

OpenTelemetry provides language SDKs that enable you to use the OpenTelemetry API to generate telemetry data with a certain programming language and export this data to a specific back-end. OpenTelemetry SDKs is the foundation for the automated instrumentation for popular frameworks and libraries that comes with OpenTelemetry contrib packages, and also enables you to write bespoke instrumentation within your application, for example, to trace in-house frameworks that are not supported by the OpenTelemetry community.

Automatic Instrumentation

OpenTelemetry supports various components that generate telemetry data from widely-adopted frameworks and libraries for supported languages. For example, outbound and inbound HTTP requests from an HTTP library generate data about those specific requests.

The way automatic instrumentation is applied to an application differs between languages due to the differences in the underpinning runtimes.

One language may require using a component loaded alongside the application, while another may prefer pulling a package explicitly in the codebase. Coverage of an ecosystem, that is how many of the popular libraries and frameworks have automatic instrumentation, is also different across languages.

Exporters

Exporters enable the OpenTelemetry implementation in an application to upload telemetry to one or several preferred backends. An exporter works by decoupling the instrumentation from your backend configuration, making it easier to change backends without changing the instrumentation you added to the code to extract the data. Moreover, since exporters effectively translate the OpenTelemetry data to another format, e.g., the Jaeger trace protocol, one can effectively have the same data into different backends by just adding more exporters.

Learn more in our detailed guide to OpenTelemetry Architecture

OpenTelemetry vs. OpenCensus vs. OpenTracing

OpenTelemetry is the successor to both the OpenTracing and OpenCensus projects. The Cloud Native Computing Foundation (CNCF) supported OpenTracing, and Google OpenCensus. In 2019, the two projects decided to merge into OpenTelemetry, which became an incubating CNCF project.

What is OpenTracing?

OpenTracing is a discontinued project that provided a vendor-agnostic API for the writing distributed tracing instrumentation and a set of semantic conventions to have consistency in the telemetry produced. Unlike OpenTelemetry, OpenTracing was not an implementation, but rather a set of interfaces that other implementations, like Jaeger or Zipkin, could implement to increase portability. OpenTracing was initially released in 2016 and, while it is no longer developed and has never reached a stable release, it is still integrated in popular software and implemented by various tracer implementations.

What is OpenCensus?

OpenCensus provides libraries to collect application metrics and distributed traces and transfer data in real-time to a backend. It was initially created by Google and introduced into its internal census library and was released in 2018 as an open source tool. Unlike OpenTracing, OpenCensus is not formally discontinued, and some support and security patching is being provided for the foreseeable future.

OpenCensus has implementations in various languages. The metrics the tool collects use the same propagation tags and metadata, which is an idea that has lived on in OpenTelemetry with the concept of “resource”. OpenCensus collects metrics and trace data for certain processes connected to the back end irrespective of the formats and output. Applications typically import and use desired exporter metrics specific to each application.

Learn more in our detailed guide to OpenTelemetry vs. OpenTracing (coming soon)

OpenTelemetry vs. Prometheus

Prometheus is an open source tool for monitoring time-series data. It was initially developed by SoundCloud in 2012 and later got accepted into the CNCF. You can use Prometheus to collect, process and query metrics.

Here are key differences between OpenTelemetry and Prometheus:

  • Prometheus provides client libraries and framework integrations that makes it simple to expose a “metrics” endpoint. A metrics endpoint is an HTTP endpoint provided by an application, that Prometheus and other compatible software can “scrape”, i.e., send a request for the current value of metrics. Prometheus comes with its own formats to serialize metrics data, the most common of which are the textual and protobuf forms on which OpenMetrics is based.
  • Prometheus defines various “types” of metrics, like Counters, the value of which does not decrease over the lifetime of the application, and Histograms, which provide rich support for percentiles, averages and means.
  • OpenTelemetry defines its own types of metrics, called “instruments”, which are then mapped onto Prometheus’ if the respective exporter is used.
  • However, there are OpenTelemetry exporters for other metrics formats, like OpenCensus’, InfluxDB’, Elastic’ and various other proprietary and open-source formats.
  • In terms of processing and storage, there are several projects that allow scraping of receiving (e.g. via the remote-write API) metrics in Prometheus’ format. Examples are Prometheus itself, Thanos, Cortex (now superseded by Mimir), Timescale and more.

Learn more in our detailed guide to OpenTelemetry vs Prometheus (coming soon)

Tutorial: OpenTelemetry Node.js Quick Start

This example auto-instruments a Node.js app and emits metrics to the console. This tutorial is based on the OpenTelemetry quick start guide.

Prerequisites

To get started with OpenTelemetry in Node.js, install the following npm packages:

npm install @opentelemetry/sdk-node @opentelemetry/api
npm install @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/sdk-metrics-base

Step 1: Create the Example Application

We’ll use the following example application. Save this code as app.js:

const express = require("express");

const PORT = process.env.PORT || "8080";
const app = express();

app.get("/", (req, res) => {
  res.send("Hello World");
});

app.listen(parseInt(PORT, 10), () => {
  console.log(`Listening for requests on http://localhost:${PORT}`);
});

Now run the application using node app.js and make sure it is listening on localhost port 8080.

Step 2: Instantiate Tracing

We’ll use the following JavaScript code to auto-instrument the application and allow OpenTelemetry to emit metrics. Create a file with a name like trace.js in your project folder.

The trace.js file first imports dependencies:

const opentelemetry = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { diag, DiagConsoleLogger, DiagLogLevel } = require('@opentelemetry/api');

Sets log level to debug to enable troubleshooting:

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);

And initializes the OpenTelemetry SDK and defines a ConsoleSpanExporter to output metrics data to the console:

const sdk = new opentelemetry.NodeSDK({
  traceExporter: new opentelemetry.tracing.ConsoleSpanExporter(),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start()

Step 3: Monitor a Metric

Next, we’ll create an OpenTelemetry Meter to monitor metrics from the Node.js application. We’ll create a file named monitoring.js and add the following code.

The monitoring.js file sets up a MetricExporter and imports the sdk-metrics-based library:

'use strict';

const { MeterProvider, ConsoleMetricExporter } = require('@opentelemetry/sdk-metrics-base');

Creates a MeterProvider which lets you create metrics of your choice:

const meter = new MeterProvider({
  exporter: new ConsoleMetricExporter(),
  interval: 1000,
}).getMeter('your-meter-name');

And adds a simple counter metric that counts incoming requests to the application:

const requestCount = meter.createCounter("requests", {
  description: "Count all incoming requests"
});

const boundInstruments = new Map();

module.exports.countAllRequests = () => {
  return (req, res, next) => {
    if (!boundInstruments.has(req.path)) {
      const labels = { route: req.path };
      const boundCounter = requestCount.bind(labels);
      boundInstruments.set(req.path, boundCounter);
    }

    boundInstruments.get(req.path).add(1);
    next();
  };
};

Here is how to import and use this code in the Node.js application. We’ll add this code at the top of app.js:

const express = require("express");
const { countAllRequests } = require("./monitoring");
const app = express();
app.use(countAllRequests());

Now, every time a user makes a request to the application, the meter will count the request.

Step 4: Run the Application

Run the sample application using the command node app.js.

Point your browser to the address http://localhost:8080, and you’ll see the metric displayed in the console by the ConsoleMetricExporter, as follows:

{
  "name": "requests",
  "description": "Count all incoming requests",
  "unit": "1",
  "metricKind": 0,
  "valueType": 1
}
{ "route": "/" }
"value": "1"

OpenTelemetry Best Practices

Use Attributes

In OpenTelemetry, attributes are key-value pairs that provide context for distributed tracing, metrics, logs or resources. Resources are a representation of the component that emits telemetry, like a process in a container. Attributes enable teams to capture additional data to find meaningful correlations, e.g., in the face of performance changes. Whether for root cause analysis or forward-looking performance optimization, attributes can help filter, search, visualize, and aggregate telemetry data.

Here are a few types of attributes you can use to improve observability:

  • User-specific attributes—provide context about the application user involved in a session or transaction.
  • Software-related attributes—provides information about the software involved in an activity.
  • Data-related attributes—provides context about the data used or transferred in a session or activity.
  • Infrastructure-related attributes—provides context about the infrastructure that was involved in an activity.

Use Attributes Consistently

If you are using attributes in an organization with multiple teams and codebases, it is very important to consistently adopt attributes. Without this standardization, troubleshooting issues across team and codebase boundaries becomes far more complex and confusing.

Carefully Consider Cardinality

Cardinality is a measure of the number of dimensions in which telemetry data is likely to be recorded and queried. Attribute values and their indexing are the largest source of increased cardinality and, depending on the backend storing the data, it may require much more storage or slow down queries significantly.

Keep Initialization Separate from Instrumentation

One of the biggest benefits of OpenTelemetry is that it enables vendor-agnostic instrumentation. All telemetry calls made by your application are made through the vendor-independent OpenTelemetry API.

To keep this vendor independence, it is important to keep the provider configuration at the top level of your application or service (usually at the entry point). This decouples the OpenTelemetry instrumentation from instrumentation calls, allowing you to choose the tracing framework that best suits your use case without changing your instrumentation code. By decoupling provider configuration from instrumentation, you can easily switch providers using flags or environment variables.

In a continuous integration (CI) environment where you run integration tests, you may not want to run a tracing provider at all, to reduce cost and complexity. For example, in local development it might be enough to trace metrics using an in-memory export, while in production it is necessary to use a hosted SaaS service for tracing. Separating provider initialization from instrumentation makes it easy to switch providers based on your environment.

Unit Test Tracing Using Memory Span Exporters

In most cases, unit tests focus on program logic and ignore telemetry. This may lead to your telemetry being unusable when you most need it.

OpenTelemetry SDKs provide in-memory exporters, which let you query telemetry data collected during unit tests. The use of these constructs is not documented in most languages, so the best place to find examples of their use is in the OpenTelemetry unit tests for each project.

Microservices Monitoring with Lumigo

OpenTelemetry offers a pluggable architecture that enables you to add technology protocols and formats easily. Using OpenTelemtry, Lumigo provides containerized applications with end-to-end observability through automated distributed tracing.