Distributed Tracing in Microservices: Basics & 4 Tools to Know

What Is Distributed Tracing?

Distributed tracing enables you to gain visibility into microservices-based applications. DevOps teams and other IT personnel use distributed tracing to track a transaction or request as it travels across the monitored application. It helps locate issues affecting the application’s performance, such as bugs and bottlenecks.

The Microservices Observability Challenge

Cloud native applications usually employ a microservices architecture that breaks a large application into small services that use APIs to communicate with each other. A microservices architecture enables you to host different services on multiple servers and across different geographic regions.

Splitting an application into small servicers makes it more manageable and scalable, significantly increasing its flexibility and resiliency while reducing redundancy across the architecture. However, a microservice architecture poses tracing challenges that make it difficult to spot issues.

A monolithic app includes one module typically managed by one team, which makes it easier to trace requests and events and fix errors. In a microservices architecture, a request traverses across several services before a response is generated. If different teams manage and monitor each service, tracking activity across all services and identifying a problematic area becomes difficult.

It is also difficult to set up external monitoring for microservices-based applications due to the complex interlinking of services needed to create a response for each request. As a result, external monitoring can trace only the total response time and number of invocations. Distributed tracing helps solve these challenges.

In this article

How Distributed Tracing Works

Distributed tracing is a technique that enables you to gain visibility into microservices-based applications. Here is how this process works:

Traces

This process starts with one request (considering each request as a trace) that receives a unique ID (trace ID) that identifies this specific transaction. Traces represent a series of tagged time intervals called spans.

Spans

A span represents the work performed within the distributed system. Each span gets a name, a timestamp, optional metadata, and a unique ID (span ID). Spans have a parent-child relationship that shows the exact path each transaction takes through the application’s components.

Distributed tracing

A span records all activities occurring when requests move between services. After completing an activity, the parent span refers to the child span for the following activity. A single distributed trace is the result of combining all these spans in the right order, providing an overview of the entire request. After a trace runs its course, you can look for it in the presentation layer of a distributed tracing tool.

Why Is Distributed Tracing Essential for Microservices Monitoring?

To effectively monitor a microservices-based application, you need to understand how various components interact to process different user requests. It can be difficult to achieve this visibility without a centralized view of the system’s overall performance. Troubleshooting a microservices-based application requires learning what happens to each request across all touchpoints.

Distributed tracing provides the information needed to determine which services a user request went to, the time it took to process the request, how services are connected, and the failure point of a request failure. You can also aggregate tracing data to determine aspects at the macro level, such as error rate and 99th percentile latency of certain components.

4 Distributed Tracing Tools for Microservices

Lumigo

Lumigo is a cloud native observability tool, purpose-built to navigate the complexities of microservices. Through automated distributed tracing, Lumigo stitches together the many components of a containerized application and tracks every service in a request. When an error or failure occurs, users will see not only the impacted service, but the entire request in one visual map so you can easily understand the root cause, limit impact and prevent future failures.

With deep debugging data in to applications and infrastructure, developers have all the information they need to monitor and troubleshoot their containers with out any of the manual work:

Automatic correlation of logs, metrics and traces into end-to-end visualization of requests and full system map of applications
Monitor and debug third party APIs and managed services (ex. Amazon DynamoDB, Twilio, Stripe)
Go from alert (in Slack, PagerDuty and other workflow tools) to root cause analysis in one click
Understand system behavior and explore performance and cost issues

Get a free Lumigo account!

OpenTracing

In the past, code instrumentation was tightly coupled with underlying tracing platforms, requiring developers to constantly switch between tracing systems and refactor code. Unfortunately, trace requests by switching the code instrumentation across various layers and frameworks significantly slows the tracing process.

OpenTracing provides vendor-agnostic techniques and standards for distributed tracing. It uses vendor-neutral APIs and instrumentations to trace transactions, abstracting the differences between distributed deployments of tracers to ensure they can coexist in one system. This abstraction allows developers to switch tracer instances without constantly changing instrumentation.

Zipkin

Zipkin is an open source distributed tracing tool initially developed by Twitter. It visualizes trace data within and between services. Zipkin’s Java-enabled architecture includes a collector, search service, storage service, and a web-based user interface (UI). The collector is responsible for validating incoming data and passing it to a storage service like MySQL, Cassandra, or Elastic Search.

Zipkin allows users to query and retrieve traces from a database using the search service API and the web UI. You can define the context propagation features for a certain trace to get a holistic view of an entire service to pinpoint slowdowns and debug issues. It also helps you effectively perform forensics without reassembling application flows from log data.

Jaeger

Jaeger and Zipkin are compatible with OpenTracing, but Jaeger’s architecture is more focused on parallelism scalability. Jaeger’s back end was built in Go, using various components, including a collector, query API, UI, and datastore. It can accept Zipkin span requests, making it easy to switch from Zipkin to Jaeger.

Jaeger agents are responsible for listening to incoming requests and routing them to a collector that handles validation, transformation, and persistent storage. Next, the query service exposes a REST API to access tracing data from the storage service for analysis using a React-based UI.

Learn more in our detailed guide to distributed tracing tools (coming soon)