Health checks are an important factor when working with containerized applications in the cloud and are the source of truth for many applications in terms of their running status. In the context of AWS Elastic Container Service (ECS), health checks are a periodic probe to assess the functioning of containers.
In this blog, we will explore how Lumigo, a troubleshooting platform built for microservices, can help provide insights into container crashes and failed health checks. With Lumigo, we are able to get deep insights into the behavior and performance of serverless and containerized applications giving more transparency to the end user.
Let’s look at a scenario you may be faced with. We have a Python Flask application that is running in a container but the health check is intermittent. Sometimes it will return HTTP 200 OK and other times it will return HTTP Error.
How can we understand intermittent health checks?
Health checks are designed to periodically assess the availability and responsiveness of a container. This is carried out through an endpoint within the application, which is usually `/health`. The health check endpoint is expected to return HTTP 200 OK if the application is healthy and a HTTP Error if the application is unhealthy. In our case, it is returning both. This can be seen as an intermittent health check.
As a result, let’s assume that when a health check fails our application it will trigger a restart of the container. This is a good thing as it will help to ensure that the application is continuously running. However, if the health check is failing intermittently like it is in our scenario, then this can lead to a lot of container restarts.
The impact caused by bad health checks
Depending on the situation, the impact of bad health checks can be quite significant. Let’s take a look at a few examples:
Lumigo crash detection
This is where Lumigo comes into play. Lumigo is a really powerful tool and somewhat of a cheat code that has a number of features to help you identify and resolve issues with intermittent health checks.
Let’s start with the Lumigo dashboard that provides a comprehensive overview and visibility of running applications. In the dashboard, we can see a list of all the applications that are running and their health status. Having a centralized dashboard like this is really useful as it allows you to quickly identify any issues with your applications.
Next up, we have the Lumigo tracer. The tracer is a really powerful tool that allows you to trace the execution of your application. It provides a visual representation of the application flow and allows you to see the behavior of your application. This is really useful as it allows you to identify any patterns or trends that may be causing the intermittent health checks.
To insert the tracer into your application, you can use the Lumigo Python SDK. The SDK is really easy to use and can be installed using pip:
pip install lumigo_tracer
Alternatively, you can see the full documentation: https://pypi.org/project/lumigo-tracer/#description
Finally, we have the logs. The logs provide a detailed view of the application logs and allow you to see the logs in real-time. This is really useful as it allows you to see the logs as they are being generated and can help you to identify any issues with the application without any delay.
How to analyze the crashes
Once you have the Lumigo tracer installed in your application, you can start to analyse the behavior of the application. In our case, we can see that the health check is returning both HTTP 200 OK and an Error. This is causing the container to crash and restart. Let’s take a look at an example of a failed request in Lumigo Explore.
In the example below, we can see that there are a number of health check errors on our application resulting in 404 Errors.
This could be for a number of reasons, including a health check request whilst the container is in an unhealthy state. We can open up the request to find out more information by clicking on the “Details” link. This will show a dialog box with the request payload.
If we click the “See Invocation” link in the top right, we can then see the full view of the request including any Logs entries, request output and even a request flow map that will show you exactly where the error lies.
Strategies on improving health check reliabilities
Now that we can see how easy it is to identify errors and break them down using Lumigo, there are also some ways to improve this reporting pipeline for reliability and notifications.
Don’t bad health checks affect your ECS apps
Experiencing intermittent errors can be incredibly frustrating for developers, particularly when these errors result in container crashes. The health check endpoint serves as the ultimate authority for the application’s status, making its failure ripple throughout the system. When the health check fails, the consequences can be far-reaching and impactful, affecting the overall stability and performance of the application.
In this blog, we have explored the concept of intermittent health checks, delving into the challenges they pose and the potential implications that arise from them. We have discussed the importance of gaining a comprehensive understanding of running applications, as well as the significance of identifying any issues through detailed logs and visual representations of the request flow. Additionally, we have shared valuable tips aimed at enhancing the reliability of applications, ensuring smoother operations and improved performance.
If you are seeking a powerful solution to gain deep visibility into your microservices and containerized applications, sign up for Lumigo today. With Lumigo, you can unlock a comprehensive troubleshooting platform that helps you to resolve issues related to container crashes and failed health checks, ensuring transparency and improved performance for your applications.