Distributed tracing tools provide visibility into requests as they progress through distributed systems and services, including the timings of each operation and related logs and errors.
You can use these tools to understand interactions and relationships between microservices in a distributed environment, to learn how each microservice performs and affects other microservices. Distributed tracing is also critical to achieve observability when deploying applications in a cloud native environment – for example, using containerized or serverless infrastructure.
In this article
As organizations increasingly adopt technologies like containers, cloud, and serverless, and as applications continue to scale and grow more complex, observability is becoming a major challenge.
Additionally, while microservices can provide benefits to DevOps teams, a microservices architecture reduces system visibility, meaning that IT teams can miss the big picture across microservices, teams, and functions. Without proper guidance, IT teams have no effective way to identify problems and diagnose their root cause.
Distributed tracing provides a broad overview of application systems and pinpoints where errors are occurring in microservice communication. It tracks and logs all requests passing through services in a distributed environment. For example, distributed tracing allows system designers to see the performance of each function call, in order to pinpoint and troubleshoot the exact instance of a feature causing delays.
Distributed tracing uses two key concepts to provide visibility over cloud native and microservices environments:
Related content: Read our guide to distributed tracing in microservices (coming soon)
Lumigo is a cloud native observability tool, purpose-built to navigate the complexities of microservices. Lumigo’s automated distributed tracing stitches together the many components of a containerized application and tracks every service in a request. When an error or failure occurs, users will see not only the impacted service, but the entire request in one visual map so you can easily understand the root cause, limit impact and prevent future failures.
With deep debugging data in to applications and infrastructure, developers have all the information they need to monitor and troubleshoot their containers with out any of the manual work:
Get started with a free trial of Lumigo for your microservice applications
License: Apache License 2.0
GitHub: https://github.com/jaegertracing/jaeger
Jaeger is an open-source, end-to-end distributed tracing tool based on Dapper, a distributed tracing system developed by Google, and an architecture inspired by OpenZipkin. Its repository has over 16,000 GitHub stars and over 250 contributors.
Jaeger provides a web UI that can be used to visualize spans, and a backend that manages data collection and processing. Jaeger integrates with important tools in the ecosystem:
Jaeger provides client libraries for several major programming languages, including Go, Node, Java, Python, C++, and C#. In the future, these clients will be deprecated and replaced by OpenTelemetry clients.
License: Apache License 2.0
GitHub: https://github.com/prometheus/prometheus
Prometheus is an open source service that collects and stores metrics as time series data. It finds targets through service discovery or static configuration, retrieves data through the HTTP pull method, and stores it in a time series database.
Each Prometheus server node is autonomous and does not depend on distributed storage or other remote services, making it easier to manage and more reliable in containerized environments.
Prometheus can record numeric time series data and support cross-platform data collection and querying. It integrates with over 150 third-party systems, including Splunk, Kafka, Thanos, Gnocchi, and Wavefront.
Prometheus has several limitations:
License: Apache License 2.0
GitHub: https://github.com/opentracing
OpenTracing provides a set of distributed tracing standards and technologies that address three problems in traditional distributed tracing systems:
OpenTracing solves these problems, by abstracting the differences between distributed tracing deployments, allowing multiple tracers to coexist in one system without code changes. This abstraction allows developers to easily switch or add tracers, without changing tools or refactoring applications.
License: Apache License 2.0
GitHub: https://github.com/openzipkin/zipkin
Zipkin is an open source project that enables IT teams to send, receive, store, and visualize traces within and across services. Like Jaeger, it is based on Google’s Dapper tool, which captures temporary data to help solve latency problems in distributed systems. The system is implemented in Java and has an OpenTracing compliant API.
Zipkin’s architecture consists of:
A tracking ID is attached to each request to identify the request across services. Zipkin also compares traces to identify services or tasks that are running longer than others.
Zipkin’s built-in UI is a self-contained web application, providing a dependency graph that shows the number of tracking requests passed through each application to help you investigate problems.
To report trace data to Zipkin, IT admins need to instrument their applications using HTTP, Apache Kafka, Apache ActiveMQ, or gRPC. Zipkin supports Cassandra and Elasticsearch for large-scale back-end storage.