Nov 02 2022
With serverless and containerized applications becoming a norm, workloads and integrations are spread across multiple cloud environments. As these apps become increasingly more distributed, monitoring also becomes more complicated with siloed and incomplete telemetry. This is where distributed tracing brings great value. It enables end-to-end visibility in your modern and complex application. Distributed tracing facilitates robust monitoring and tracking of processes at every stage to identify performance bottlenecks and failures, diagnose the issues and fix them.
For most organizations, the importance of distributed tracing is undisputed. But many get stuck in the dilemma of whether to acquire an observability tool for distributed systems or build a solution suiting their specific business needs.
In this blog post, we explore the factors that you must consider when deciding on whether to build or buy a distributed tracing solution. Before we head there, let’s quickly understand what distributed tracing is and its significance.
What is distributed tracing and why is it important?
Distributed tracing is the monitoring practice of tracing how requests move across the distributed environments of serverless and containerized applications. For every single functionality of an application, a customer request goes through multiple services spread across different environments. Failures can occur deep in the architecture and spread “backwards” towards the end user (e.g., a failed database query resulting in an HTTP status code 500 being sent to the frontend). Latency at any service stage can cause a delay in execution, and builds up across layers of the system.
Distributed tracing has two approaches – agent-based and agentless:
Agent-based tracing: With this approach, you install an agent in the form of code on the system you want to be monitored. The agent will collect different metrics for you to evaluate the system’s performance. Depending on the number of agents and the resources they consume, agents can become a drain on system resources.
Agentless tracing: A unique identifier is assigned to the requests flowing through the different application’s services to define a single and complete request flow. This doesn’t impact performance of the system, and is ideally suited for distributed microservice applications.
Characteristics of a good distributed tracing system
Although necessary, distributed tracing is a challenging observability practice to implement. Microservices demand a resilient distributed tracing system to effectively detect issues in serverless and containerized environments across a variety of technologies. A good distributed tracing system has the following characteristics:
Accuracy: Produces faithful tracing data and visualizations to give you an exact account of what is happening within your applications.
Reliability: Ensures that the data across different services are collected reliably and processed without loss of data and context.
Efficiency: Manages a great inflow of data without excessive performance overhead (e.g., in terms of memory consumption) or other issues.
Compliance: Implements strong compliance and data governance policies to avoid exposure of confidential data like account and user IDs.
Now that we understand the key traits of a solid distributed tracing system, let’s discuss if you should build or buy one for your application.
Building vs Buying Distributed Tracing
Quick to start with, hard to master
A good distributed tracing system empowers you with fundamental capabilities like reduced MTTR, quick understanding of the cause and effect between service issues, and pin-pointing issues affecting specific user actions. However, building a solution that covers all aspects of tracing is deceptively complex and a time and resource-intensive endeavor. For small applications involving one or two teams, and with little technology diversity, it may be easy to get something up and running relatively quickly, but complexity piles up as service and team count grows, and more and more effort must be devoted to instrumenting applications.
Buying, on the other hand, allows you to leverage the leading observability solutions available without investing the effort to build one in-house. The best solutions among these have put a lot of work into making it straightforward (ideally: turn-key) to introduce distributed tracing in an application, minimizing the effort to achieve coverage of the distributed-tracing system.
Source: lumigo.io
Licensing money vs. developer’s wage
The best organizations understand that a software license costs less than the cumulative costs of maintaining an engineering team. Buying software might have an upfront expense, but it pays off in the long run and frees up engineering capacity to solve the unique challenges of a company. Designing an in-house distributed tracing system will involve hefty investments to build, test, support, and upgrade that system, not to mention the costs of hiring, keeping, and growing developer talent. When we consider that distributed tracing is a well-understood problem in the observability space, but making a solid, reliable, production-reay observability system using open source is expensive, many companies choose licensing an existing solution.
Niche technology to monitor
If your product relies on technologies with limited adoption that are not supported by existing solutions, it’ll need a customized observability solution. But as interesting as this sounds, it is a very rare occurrence. Most software that organizations build have more in common with one another than they think.
Also worth mentioning is that the increasing relevance of OpenTelemetry as the de-facto open-source distributed tracing instrumentation, and its underpinning OpenTelemetry Protocol (OTLP), enable adopters to seamlessly mix proprietary and custom distributed tracing with many existing observability solutions. So, you might need some custom instrumentation for the niche parts of your technology, and reuse off-the-shelf solutions for the rest.
Open source software ecosystem
The OpenTelemetry project has gone to great lengths to reduce the barrier to collecting telemetry, and especially distributed tracing data. Still, many find it hard to integrate OpenTelemetry SDK in applications and troubleshoot them without significant, specialized expertise in distributed tracing. And while some open source software (OSS) backends exist that can ingest and process OpenTelemetry data, they are not easy to operate reliably at scale, and are limited in the analysis and alerting capabilities they offer.
Additionally, if you’re planning to build a solution based on the OSS ecosystem, keep in mind that you will have to take and keep full ownership of customizing it for your organization, including handling security breaches that can occur down the line and added support for more technologies you may adopt in your architectures.
Procurement challenges and company size
When you build an in-house system, the budget for it comes mostly out of the time and effort your engineers spend working on something other than your product. It is notoriously hard to measure how much effort and time it costs to run an internal system, both in terms of developing and operating the system itself, as well as the cost on other teams resulting from gaps in features and usability. The resulting cost underestimation leads many engineering organizations to discount in-house systems as “cheaper”. And that is not the only reason: it is not uncommon even to see engineering organizations go down the “build” route to avoid having to secure a monetary budget and work with procurement. Large companies also have much larger availability of “disposable engineering capacity”, which can skew the decision towards build.
Indeed, the procurement of a solution can be a challenge. If you’re buying a tracing tool, you’ll have to devote effort to scouting, evaluating, budgeting and adopting the tool. However, the cost of maintaining and improving the procured system sits largely on the vendor, reducing the amount of effort down the line considerably when the solution is well-evolved over time to match emerging requirements.
Conclusion
Unique business needs, architectural complexity, and the expectation of needing customized monitoring for serverless applications are key motivators for organizations to build a monitoring solution. However, unless you are an industry behemoth with abundant talent and using niche technologies, building a distributed tracing tool in-house will put you in a tight spot.
Meanwhile, buying a feature-rich solution with comprehensive monitoring and visualization capabilities will enable you to quickly optimize your application performance. While the market is full of observability solutions, it is better to select an end-to-end observability and debugging solution like Lumigo that automates distributed tracing to help you manage the complexity of serverless and containerized environments.
In today’s competitive landscape, distributed tracing is becoming a necessity for organizations to deliver a seamless and consistent application. Regardless of your choice to build or buy a monitoring tool, it is important to make a decision on how you will observe your cloud-native systems and to get started quickly to reap the benefits of better observability.