A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact in a decentralized manner and work together to achieve a common goal. Working with distributed systems is challenging, because failure often spreads between components and debugging across multiple components is difficult and time-consuming. Distributed tracing is a way to automatically collect data about how the various components in a distributed system interact to serve specific requests, and it is an irreplaceable tool to troubleshoot distributed systems. In this blog, we will explore what distributed tracing is and how it can be used to monitor and debug a distributed system.
Distributed tracing is a technique used to help Software Engineers, SREs and DevOps engineers to understand and gain an insight into what is going on and the behavior of a distributed system. By tracking the flow of requests and transactions across different services, this provides a holistic view of the system and can help to identify issues, faults, bottlenecks and performance issues.
To implement distributed tracing, engineers such as you and I, need to write code or add libraries to our applications that generate and propagate trace context, and record spans, which describe what each component is doing. As trace context is propagated across our applications and spans are collected we can look at all the spans in a trace to visualize and analyze the flow of requests and transactions across different services.
Simply having traces and logs in place is sometimes not enough to debug an intermittent failure. These failures can come from deep within a system that might be making a lot of noise. As a developer, you really need to identify the difference between a successful trace and a failed trace in order to identify the issue. Some of the challenges faced when doing this are:
That being said, from the get-go it is always good to follow some best practices when debugging failures in a distributed system. These best practices will help you to debug quicker, and mitigate some of the challenges mentioned above.
Let’s look at a classic distributed tracing scenario: Lambda → HTTP → ECS Service (Python Flask) → Database. The application is throwing intermittent failures in the access to the database and in turn, this causes issues in the user-facing Lambda function, somewhat of a bubble-up effect.
There are a number of steps you can take when troubleshooting this example. In no particular order, here are a few:
That is a lot of manual steps that have to be taken to identify and isolate the cause of the failures and this is where distributed tracing tools come in handy. Distributed tracing tools can help you to identify the root cause of the failure by providing you with a holistic view of the system by embedding trace IDs in the logs and traces. This allows you to trace the flow of requests and transactions across different services at a granular level.
This is where Lumigo enters the room. Lumigo’s distributed tracing capabilities gives you the power to drill down into the request and see the exact path it took through your system, somewhat like a tracked GPS service. As a developer you may set thresholds for error rates and latency and if one of your services encounters an error, for example, a response rate spikes then Lumigo can send you an alert. It can also help you visualize bottlenecks in your infrastructure with a representation of the flow of requests within your entire system. With this information, you can see a birds eye view of the infrastructure in place, thus helping you to debug and identify issues from one single dashboard and set of tools.
Now let’s look at how you should approach analyzing the data in front of you. There are a number of ways you can do this, but here are a few tips:
Lumigo is a tool that will help you do all of the above with ease by providing you with a complete system map, live tail of logs, a list of all transactions happening, ECS monitoring giving you a complete picture of your CPU and memory utilization and much more!
With a distributed system, you are relying heavily on logs and traces to maintain good order and minimize downtime but there are a few steps you can take to optimise your monitoring and alerting strategies.
To begin with, you should outline your clear business objectives. You should have a clear understanding of your service level objectives (SLO’s) and key performance indicators (KPI’s). By having these clearly outlined it will prevent any B2C disputes and allow you to maintain a highly performant system. Some of the metrics measured here can include latency, throughput, error rate, availability and recovery time objective (RTO).
You then need to implement a monitoring tool such as Lumigo to help track and visualize the KPI’s and SLO’s. Within a tool such as Lumigo, you can set up proactive alert policies to ensure you are notified if anything out of the ordinary is happening. For example, you may have an alert to signal when a latency exceeds a certain threshold. Alongside having the alerts in place you will also need them to be actionable and provide enough context to ensure an engineer can quickly identify the root cause of the issue and resolve it.
Finally you should consistently refine and adjust your SLO’s and KPI’s in order to keep them up to date. As the landscape of your system changes, you may need to add new metrics or remove redundant ones as well as testing what is currently in place to ensure your system is working as it should. By doing all of this, and becoming familiar with the system in hand, you are eliminating the possibility for errors and minimizing your overall downtime if something does go wrong. To find out more on SLO’s, see our blog post on defining and measuring your SLI’s and SLO’s.
Before moving on, let’s see how simple it is to set up your own alerts using Lumigo.
Sign into your Lumigo account and head to your dashboard. Then click on “Alerts” on the side menu. Once you are on your Alerts page, click on the “Create New Alert” button in the top right corner of the page.
Now you are on the new alert page, first choose the alert type, and this is where you will see first hand Lumigo’s tight integration with AWS.
Next, give it a description and choose the service type you are wanting to monitor.
Now you need to give it a condition to work with. This is where you choose what resources you want to monitor. This can be done with the resource directly or using tags.
Finally, after that has been selected, you now just need to tell it how to notify you and how often. You have three options here, Email, PagerDuty or Microsoft Teams.
When it comes to building a resilient system, it can be hard but there are a few best practices you can follow to help you along the way. Here are a few that you should follow:
By using these best practices you can help ensure your database and system integrations remain available, resilient, and secure.
Overall, it is important to stay vigilant and familiar when building and maintaining a distributed system. Observability is key, especially when things do not go to plan. In this post, we have looked at how to troubleshoot database access issues in Python Flask-based ECS services, how to analyze stack traces, transactions, invocations and timelines and how to implement effective monitoring and alerting strategies using Lumigo. We have also looked at how to prevent database access failures and some best practices to consider.
Happy observing and stay tuned for the next blog post in this series on troubleshooting ECS and slow draining queues.