Jun 27 2023
Health checks for cloud infrastructure refer to the mechanisms and processes used to monitor the health and availability of the components within a cloud-based system. These checks are essential for ensuring that the infrastructure is functioning correctly and that any issues or failures are detected and addressed promptly. Health checks typically involve monitoring various parameters such as system resources, network connectivity, and application-specific metrics.
When using distributed software systems running on cloud-based infrastructure, assessing the health of a system and providing a measure of its health becomes more complex. The complexity arises from the need to define appropriate health check configurations for each component that comprises a final app or web service. This includes specifying the endpoints, ports, and protocols to be monitored, as well as defining the criteria for success or failure for the application as a whole. Additionally, managing health checks at scale in a dynamic environment like Kubernetes can be challenging, especially because it involves containerized workloads, a potentially large number of pods, services, and replicas.
Additional factors that complicate health checks in containerized applications are:
- When using 12-factor apps, we typically make use of a dynamic service discovery mechanism, where services can be created, scaled, and removed automatically. Health checks need to adapt to the changing network topology and discover the appropriate endpoints to monitor.
- Applications built as distributed systems are designed such that an app is self-contained and does not rely on external resources. For example, there may be health checks that bind to a specific port, dynamically assigned by the execution environment. This will mean that environment variables (or an equivalent configuration) will be required to ensure that the health check config always aligns with the dynamic port assigned to the application during runtime.
- Applications often have dependencies on other services or resources. In order to assess the health of an application, one must take into account all the dependencies and ensure that the entire stack is healthy before marking a component as ready.
- Typically applications are run as one or more stateless processes. Each process type typically has a specific purpose, such as a web server, worker, or background task. The implication – health checks should be appropriately configured for each process type. For example, a web server process type might have an HTTP endpoint for health checks, while a worker process type might require a different mechanism, such as a message queue or database connection check.
- The lifecycle of a container includes starting, stopping, and restarting. If heath checks are not designed to factor these dynamic states of containers, they may report false negatives as the containers may still be initializing. A delay or initial startup period before the health checks begin is often necessary to avoid false failures.
Basic Health Check Design
There are also general considerations with health check designs. The health check should be independent of the application logic, so that it can be easily tested and maintained. This can be difficult to achieve, especially if the application logic is complex. Resilient health checks can complicate things further. The health check should be resilient to failures in the application or underlying infrastructure. This can be difficult to achieve, especially if the application is distributed or uses third-party services. Providing a clear and consistent health status that is using a language-agnostic health check library is important for a successful implementation. This can be done by using a standardized health status code using the OpenAPI specification. A code snippet follows:
openapi: 3.0.0
info:
title: My API
version: 1.0.0
paths:
/health:
get:
summary: Get the health of the API
operationId: getHealth
responses:
200:
description: The API is healthy
content:
application/json:
schema:
type: object
properties:
status:
type: string
example: "Healthy"
503:
description: The API is unhealthy
content:
application/json:
schema:
type: object
properties:
status:
type: string
example: "Unhealthy"
Here’s a sample code snippet for a Python/Flask app that invokes the Open API health check.
from flask import Flask
from flask_healthcheck import HealthCheck
app = Flask(__name__)
# Initialize the health check
health = HealthCheck()
# Add a health check for the application
health.add_check(
"application",
lambda: app.ping(),
description="Checks the health of the application",
)
@app.route("/health")
def health():
return health()
if __name__ == "__main__":
app.run(debug=True)
To overcome the challenges, it is crucial to understand the Kubernetes health check mechanisms, design appropriate checks for different components, and leverage monitoring and automation tools to simplify the management and analysis of health check data. The starting point is to understand the Kubernetes-native way of enabling health checks. Many of the anti-patterns we identified in the opening paragraph are prevented by the built-in design available with Kubernetes as the container orchestrator. Kubernetes provides built-in functionality for health checks through two primary mechanisms: liveness probes and readiness probes.
> Liveness probes are used to determine if a container within a pod is running properly. It verifies that the application is responsive and functioning as expected. If the liveness probe fails, Kubernetes will automatically restart the container.
Here is how to define a livenessProbe for a pod:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: my-app-container
image: my-app-image
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
> Readiness probes are used to determine if a container within a pod is ready to receive network traffic. It ensures that the application is fully initialized and capable of serving requests. If the readiness probe fails, the container is marked as not ready, and Kubernetes will stop sending traffic to it.
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: my-app-container
image: my-app-image
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
For a readinessProbe, the failureThreshold field sets the number of consecutive failures allowed before considering the container as not ready. In this case, if the probe fails three times in a row, the container will be marked as not ready.
Implementing OpenTelmetry for Enhanced Observability
The use of OpenTelemetry (OTel for short) can standardize the instrumentation of your applications. This can make it easier to collect, analyze, and correlate telemetry data across services. This standardized approach to observability helps improve troubleshooting, performance optimization, and understanding of the interactions within distributed web applications. OTel is implemented as an open-source project that consists of a set of APIs, libraries, agents, and instrumentation. OpenTelemetry is designed to be vendor-neutral and portable, so that it can be used with any observability or monitoring tool. It is also designed to be extensible, so that new telemetry data sources and analysis tools can be added easily.
Here is a basic code stub that demonstrates how to set up OTel using the Go language libraries.
package main
import (
"context"
"log"
"net/http"
"os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/stdout"
"go.opentelemetry.io/otel/propagation"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
)
func main() {
// Init steps that sets up the OTel exporter and trace provider
exporter, err := stdout.NewExporter(stdout.WithPrettyPrint())
if err != nil {
log.Fatal(err)
}
provider := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exporter))
otel.SetTracerProvider(provider)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
tracer := otel.Tracer("example")
ctx := r.Context()
ctx, span := tracer.Start(ctx, "handleRequest")
defer span.End()
// function call stub
doSomethingUsefulHere()
// Set the identifier on the span
span.SetAttributes(attribute.String("custom_id_attribute", "example_identifier"))
// Return response
w.WriteHeader(http.StatusOK)
w.Write([]byte("Hello, World!"))
})
// Start the server
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
log.Printf("listening on port %s", port)
log.Fatal(http.ListenAndServe(":"+port, nil))
}
OpenTelemetry is designed to work with all commonly used languages and frameworks. Here’s a code stub that shows how to make use of OTel libraries in JavaScript:
const { ConsoleExporter } = require("@opentelemetry/sdk-trace-base");
const { Resource } = require("@opentelemetry/resources");
const { SemanticResourceAttributes } = require("@opentelemetry/semantic-conventions");
Using Enhanced Observability Tools
An alternative option is to make use of an observability tool written by a vendor to create these health checks. For Instance our troubleshooting platform is designed for microservice-based applications running with Kuberntes, AWS Elastic Container Service (ECS) and AWS Lambda.
Our mission is to automate observability as much as possible, allowing development teams to devote more time to building and debugging. With this aim, we’ve designed Lumigo to seamlessly integrate with Kubernetes. The deployment involves just a single line of command and does not require any changes to your existing codebase. Our Kubernetes Operator is designed to automatically detect and apply updates to applications within a Kubernetes namespace. This feature simplifies tracing requests across multiple services, helps in pinpointing performance bottlenecks, and enables swift error identification.
With Lumigo, you can gain visibility into your microservice architecture, understand the flow of requests, and optimize performance, ultimately ensuring the reliability and efficiency of your applications.
package main
import (
"fmt"
"log"
"net/http"
"os"
"time"
"github.com/lumigo-io/lumigo-go-tracer"
)
func main() {
// Init steps for setting up the Lumigo handshake
token := os.Getenv("LUMIGO_TOKEN")
if token == "" {
log.Fatal("LUMIGO_TOKEN environment variable not set")
}
lumigo.Init(lumigo.Config{Token: token})
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
// Perform any health check logic
isHealthy := true
// basic application health check logic
if isHealthy {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
} else {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Service Unavailable"))
}
})
// run the service
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
log.Printf("Server listening on port %s", port)
log.Fatal(http.ListenAndServe(":"+port, lumigo.MiddlewareHandler(http.DefaultServeMux)))
}
Inefficiency Remains Unresolved
Observability tools generate vast amounts of data, including metrics, logs, traces, and more. Without proper filtering, aggregation, and visualization mechanisms, this abundance of data can become overwhelming, making it challenging to extract actionable insights. With Kubernetes this is further exacerbated because it generates data from a variety of sources, including pod logs and cluster events.
This then forms the basis for inept instrumentation in many ways. Increased latency due to excessive resource consumption, affecting the overall performance of the system is a common side-effect. A high volume of noise and false positives emerge, which can lead to alert fatigue, where important signals become lost is another symptom of this inefficiency.
When using tools created by vendors who have metered ingestion and processing pipelines, running costs for observability tools can become inordinately high. It may be impractical on both ends of the spectrum. Having too much instrumentation or too little can both be detrimental to the needs of the team. DevOps teams are known to find difficulties in keeping their observability costs under check.
At Lumigo, we prioritize the developer experience over everything else. It is our aim to build observability tools with the needs of the customer at the core of our design. In order to help create a more robust health check capability, we are introducing the ability to filter out HTTP calls that specifically target health checks on ECS clusters. Lumigo is now configured to drop spans that contain ELB-HealthChecker/* in the header and are returning a 200 OK status. The processing will be limited to those spans that are known to belong to failed health checks.
Set up a free Lumigo account today and give this feature (and many more great ones) a try.