Distributed Tracing Tools: The Basics and 5 Tools You Should Know

What are Distributed Tracing Tools?

Distributed tracing tools provide visibility into requests as they progress through distributed systems and services, including the timings of each operation and related logs and errors.

You can use these tools to understand interactions and relationships between microservices in a distributed environment, to learn how each microservice performs and affects other microservices. Distributed tracing is also critical to achieve observability when deploying applications in a cloud native environment – for example, using containerized or serverless infrastructure.

In this article

How Distributed Tracing Works

As organizations increasingly adopt technologies like containers, cloud, and serverless, and as applications continue to scale and grow more complex, observability is becoming a major challenge.

Additionally, while microservices can provide benefits to DevOps teams, a microservices architecture reduces system visibility, meaning that IT teams can miss the big picture across microservices, teams, and functions. Without proper guidance, IT teams have no effective way to identify problems and diagnose their root cause.

Distributed tracing provides a broad overview of application systems and pinpoints where errors are occurring in microservice communication. It tracks and logs all requests passing through services in a distributed environment. For example, distributed tracing allows system designers to see the performance of each function call, in order to pinpoint and troubleshoot the exact instance of a feature causing delays.

Distributed tracing uses two key concepts to provide visibility over cloud native and microservices environments:

Traces—a complete request process, which is divided into spans.
Spans—an activity within a marked time interval, occurring within a single component or system service. By evaluating each span within a trace, an IT manager can determine the cause of a problem.

Related content: Read our guide to distributed tracing in microservices (coming soon)

5 Distributed Tracing Tools You Should Know

1. Lumigo

Lumigo is a cloud native observability tool, purpose-built to navigate the complexities of microservices. Lumigo’s automated distributed tracing stitches together the many components of a containerized application and tracks every service in a request. When an error or failure occurs, users will see not only the impacted service, but the entire request in one visual map so you can easily understand the root cause, limit impact and prevent future failures.

With deep debugging data in to applications and infrastructure, developers have all the information they need to monitor and troubleshoot their containers with out any of the manual work:

Automatic correlation of logs, metrics and traces into end-to-end visualization of requests and full system map of applications
Monitor and debug third party APIs and managed services (ex. Amazon DynamoDB, Twilio, Stripe)
Go from alert (in Slack, PagerDuty and other workflow tools) to root cause analysis in one click
Understand system behavior and explore performance and cost issues

Get started with a free trial of Lumigo for your microservice applications

2. Jaeger

License: Apache License 2.0

GitHub: https://github.com/jaegertracing/jaeger

Jaeger is an open-source, end-to-end distributed tracing tool based on Dapper, a distributed tracing system developed by Google, and an architecture inspired by OpenZipkin. Its repository has over 16,000 GitHub stars and over 250 contributors.

Jaeger provides a web UI that can be used to visualize spans, and a backend that manages data collection and processing. Jaeger integrates with important tools in the ecosystem:

Provides native support for the OpenTracing standard, allowing you to store distributed trace records in a database of your choice.
It natively supports Cassandra and Elasticsearch, and its community has extended support to other databases like InfluxDB and DynamoDB.
The backend component exposes Prometheus metrics by default, making it easier to integrate with Prometheus.

Jaeger provides client libraries for several major programming languages, including Go, Node, Java, Python, C++, and C#. In the future, these clients will be deprecated and replaced by OpenTelemetry clients.

3. Prometheus

License: Apache License 2.0

GitHub: https://github.com/prometheus/prometheus

Prometheus is an open source service that collects and stores metrics as time series data. It finds targets through service discovery or static configuration, retrieves data through the HTTP pull method, and stores it in a time series database.

Each Prometheus server node is autonomous and does not depend on distributed storage or other remote services, making it easier to manage and more reliable in containerized environments.

Prometheus can record numeric time series data and support cross-platform data collection and querying. It integrates with over 150 third-party systems, including Splunk, Kafka, Thanos, Gnocchi, and Wavefront.

Prometheus has several limitations:

The data it collects is not detailed enough to provide 100% accuracy.
With local storage, it has a 15-day retention period, so long-term storage requires an external platform or service.
Prometheus is not a full distributed tracing solution, and needs to be combined with other tools.

4. OpenTracing

License: Apache License 2.0

GitHub: https://github.com/opentracing

OpenTracing provides a set of distributed tracing standards and technologies that address three problems in traditional distributed tracing systems:

The need to use vendor-specific APIs and tools, which creates lock in.
The need to switch between different tools to accurately track requests across different frameworks and layers.
Tight coupling with the underlying tracing platform, requiring programmers to refactor their code when switching tracing systems.

OpenTracing solves these problems, by abstracting the differences between distributed tracing deployments, allowing multiple tracers to coexist in one system without code changes. This abstraction allows developers to easily switch or add tracers, without changing tools or refactoring applications.

5. Zipkin

License: Apache License 2.0

GitHub: https://github.com/openzipkin/zipkin

Zipkin is an open source project that enables IT teams to send, receive, store, and visualize traces within and across services. Like Jaeger, it is based on Google’s Dapper tool, which captures temporary data to help solve latency problems in distributed systems. The system is implemented in Java and has an OpenTracing compliant API.

Zipkin’s architecture consists of:

A collector that queries the trace data function
A reporter that gets scope and trace data from the tracer library into Zipkin.
A web UI for viewing traces
An API for querying and extracting traces.

A tracking ID is attached to each request to identify the request across services. Zipkin also compares traces to identify services or tasks that are running longer than others.

Zipkin’s built-in UI is a self-contained web application, providing a dependency graph that shows the number of tracking requests passed through each application to help you investigate problems.

To report trace data to Zipkin, IT admins need to instrument their applications using HTTP, Apache Kafka, Apache ActiveMQ, or gRPC. Zipkin supports Cassandra and Elasticsearch for large-scale back-end storage.