Distributed Tracing

  • Topics

What is Distributed Tracing?

As  Marc Andreessen put it, “Software is eating the world”.

And that intense competition in every sector has set a clear imperative for software engineering teams – release working software to the end users as quickly as possible. The need for increased development velocity has forced engineers to find ways to move faster.

There are many solutions and methodologies like Agile, DevOps and more. One of the main methods was the move from monolithic applications to microservices-based applications. The idea is that by breaking the dependencies and coupling between different teams, organizations are able to move faster and with fewer barriers.

Furthermore, as the different microservices communicate through a well-defined protocol (like REST, HTTP, JSON), each team can build its microservice in a different runtime environment (or development language), the one that is most adequate to the task.

The challenge of monitoring microservices

No good comes without its challenges though, and one of the main issues that arises from this methodology is the increased difficulty of monitoring and debugging.

In a monolith application, all parts of the code are running together, making it relatively simple to isolate a transaction and, when things don’t work as expected, look at the application stack trace and understand exactly what happened and what went wrong.

In a microservices world, the boundaries are more blurred and when going to serverless computing (which is the next stage of evolution and often called nanoservices) the problem becomes even more acute. With so many moving parts, many operating in an asynchronous way, it is hard to monitor, debug and troubleshoot the application.

Microservice applications are significantly more complex than traditional 3- tier apps

A new solution was required and in April 2010 Google released a paper on distributed tracing, “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” ,describing the problem and ways to address the challenge. Many papers soon followed and extended the ideas presented. The main concept of distributed tracing is the need for the different software components to report exactly what they are doing and the context of their activity. These pieces of metadata should be sent and gathered in a central repository that will create a full picture from the stream of data.

Introducing the Span

The Cloud Native Computing Foundation created the OpenTracing Project and defined a term called SPAN, defining it as:

“The primary building block of a distributed trace, representing an individual unit of work done in a distributed system. Each component of the distributed system contributes a span – a named, timed operation representing a piece of the workflow.”

Within a transaction, there is a parent or a root span (the first span of a transaction) and there are child spans (the spans that follow the parent one).

The next ingredient we need is called a tracer – an engine that collects all the spans that are sent and turns them into a meaningful transaction story. There are many tools that help gather this information. The most widely used are Zipkin and Jaeger, both of which addressed open tracing for microservices well.

Developers need to make sure that all components are sending out the spans. In most cases, it requires some instrumentation to the code.

The above provides a brief explanation of the problem, and describes the concepts introduced to address the challenge. To dive deeper into implementation, refer to the documentation of the OpenTracing Project, or the documentation of Zipkin and Jaeger.

If you’re interested in learning more about the latest advances in serverless application monitoring, get in touch to find out how Lumigo is making distributed tracing for serverless extremely simple.

Debug fast and move on.

  • Resolve issues 3x faster
  • Reduce error rate
  • Speed up development
No code, 5-minute set up
Start debugging free