All Posts

Kubernetes Design Patterns For Optimal Observability

Kubernetes Design Patterns

Technology is a fast-moving commodity. Trends, thoughts, techniques, and tools evolve rapidly in the software technology space. This rapid change is particularly felt in the software the engineers in the cloud-native space make use of to build, deploy, and operate their applications. One particular area where we see rapid evolution in the past few years/months is Observability. 

Observability for cloud-based software refers to the ability to gain insight into the internal workings and performance of software running in the cloud. Recent changes in the areas of application deployment, which have rendered web applications as distributed and dynamic ones have caused tremendous change in the way these systems are monitored. 

What have these changes to software development been? Web applications have indeed undergone a significant evolution from monoliths to microservices and eventually to containerization. Traditionally, web applications were built using a monolithic architecture. To address the limitations of monolithic architecture, the industry started embracing microservices architecture. Containerization is a technique that allows applications and their dependencies to be packaged as lightweight, self-contained units called containers. Containers provide isolation, encapsulation, and portability, making it easier to deploy and run applications consistently across different environments. 

Ergo, the role and process around observability has changed dramatically in this time frame. The basic premise of observability involves collecting and analyzing various types of telemetry data and metrics to understand how the system behaves. This allows engineering teams to make informed decisions for optimization and troubleshooting. The lack of proper observability would affect engineering teams in many ways. It would result in having software systems that have zero transparency. It would be limiting the ability of a team to engage in proper capacity planning and troubleshooting. It would also result in systems that are inefficient because of the lack of data about resource utilization and be impossible to optimize.

This post is written with the intent of being a starting point for software engineers, who are typically working with Kubernetes, to improve the understanding of their systems. This goal ― we believe ― can be realized by improving the observability of their software in production. Kubernetes is a complex system, but the right combination of tools and techniques can provide engineers with the required degree of transparency they need to work well with the applications running on them.

Observability In A Kubernetes World

Kubernetes observability has to be thought of at three levels. 

The first is at the application-level. This is somewhat independent of the choice of Kubernetes itself, but is influenced by it.  The second is Kubernetes itself. The way the API, its controllers, and its components work. The third is parameters pertaining to infrastructure that powers Kubernetes. Raw compute upwards, there are ways in which infra influences the performance of applications. 

By understanding how these three pieces interact with each other and form a complete stack, we can design for optimal observability. 

Patterns For Application-level Observability

Application monitoring and observability has a vast history of innovation. Designing for observability follows many popular architectural patterns in software development, especially for building systems that require responsiveness, scalability, and loose coupling between components. 

Monitoring Event-driven Applications

One of the first principles for application-level monitoring we will consider are event-driven applications. 

Event-driven applications are a popular architectural pattern in software development, especially for building systems that require responsiveness, scalability, and loose coupling between components. In event-driven programming, the flow of the application is determined by events that occur asynchronously, such as user actions, messages from other components, or changes in system state.

Monitoring event-driven applications involves tracking events, their processing, and the overall health of the system. 

The specific monitoring requirements for event-driven applications may vary based on the complexity, scale, and specific technology stack used. 

Here is an example of a Node.js app with a basic eventing functionality demonstrated:

const http = require('http');
const events = require('events');

const hostname = '127.0.0.1';
const port = 8081;

const sampleEmitter = new events.EventEmitter();

 sampleEmitter.on('ping', function (data) {   
 console.log(data);
 });

sampleEmitter.emit('ping', 'Node.js Example showcasing events and the emitters');

let triggered = 0;
samoleEmitter.once('event', () => {
  console.log(++triggered);
});
sampleEmitter.emit('event');

sampleEmitter.on('error', (err) => {
  console.error('something went horribly wrong!' + err);
 });
sampleEmitter.emit('error', new Error('error'));

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end('log message confirms app running');
});

server.listen(port, hostname, () => {
  console.log(`Sample app listening at http://${hostname}:${port}/`);
});

Any observability/monitoring software can be configured to listen for these events. Some tools capture their context and provide information for those debugging or troubleshooting the apps. 

Ingesting Application Metrics Data

Continuing with the thread of a sample Node.js app, this is an example of how data about an app in the form of default metrics can be ingested as a data source. The example that follows shows how metrics collected by the app are sent to external sources: 

const http = require('http')
const url = require('url')
const client = require('prom-client')

const registrySample = new client.Registry()

registrySamoke.setDefaultLabels({
  app: 'metrics-ingestion-sample'
})

client.collectDefaultMetrics({ registrySample })

const server = http.createServer(async (req, res) => {
  const route = url.parse(req.url).pathname
  if (route === '/metrics') {
    res.setHeader('Content-Type', register.contentType)
    res.end(registrySample.metrics())
  }
})

Monitoring Kubernetes Components

In this section, we will demonstrate how some native Kubernetes components will be monitored. The purpose behind this is to illustrate the native capabilities of Kubernetes that supports observability. It also illustrates that a key component of the observability stack derives from abstractions and orchestration layers that are configured over underlying infrastructure.  

Monitoring kube-apiserver

The Kubernetes API server is the core of the control plane, and therefore a key component of any cluster. It typically runs as a container within the kube-system namespace.

The kube-api container is readily instrumented to help monitor the Kubernetes API server by providing a metrics endpoint that can be scraped without additional exporters. Here’s an example of a Prometheus job which can help bootstrap kube-apiserver monitoring:

$ kubectl get cm prom-server -n monitoring -o yaml > prom-server.yaml
$ vi prom-server.yaml
scrape_configs:
    - bearer_token_file: /file/path/secrets/kubernetes.io/serviceaccount/token
      job_name: kube-api-server
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: keep
        regex: default;kubernetes;https
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_service_name
        - __meta_kubernetes_endpoint_port_name
      scheme: https
      tls_config:
        ca_file: /file/path/secrets/kubernetes.io/serviceaccount/ca.crt

There are several metrics which can be monitored on the kube-apiserver. One of the highest cardinality metrics is apiserver_request_duration_seconds_bucket. Measuring this will provide a way to ascertain the latency of the requests being served by the cluster. 

Another useful parameter to measure is the apiserver_request_total which can help determine traffic volume, error rates, and the like. This metric can be refined based on HTTP request type and status codes in order to determine the nature of the traffic, leading to discovering error rates, traffic types, etc. 

Monitoring kube-controller-manager 

The kube-controller-manager is the component in a Kubernetes cluster that continuously looks at the state of the cluster and matches it with the desired state. It is designed as a daemon within which the control loops, that are central to the Kubernetes system, are located. 

Here is an example of how you could setup monitoring of kube-controller-manager. This example shows how to use  a Prometheus ServiceMonitor within an example yaml file.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kube-controller-manager-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      component: kube-controller-manager
  endpoints:
    - port: metrics
      interval: 30s

The following configuration will also have to be added to the prometheus.yaml file used. 

scrape_configs:
  - job_name: kube-controller-manager-scraper
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - monitoring
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_component]
        action: keep
        regex: kube-controller-manager

Many useful metrics are exposed by the kube-controller-manager by default. Among them workqueue_work_duration_seconds is a key metric which provides an instant idea of what the load looks like for a controller. Some associated metrics such as workqueue_unfinished_work_seconds and workqueue_longest_running_processor_seconds can help identify the processes that are blocking others. This can help pick out the rogue processes or outliers slowing down each controller. 

Monitoring kube-scheduler

This is a third example of a native Kubernetes component that has built in instrumentation that can be used to gauge the health of Kubernetes clusters.

apiVersion: v1
kind: Pod
metadata:
  name: test-app-with-sidecar-pod
spec:
  containers:
    - name: testapp-container
      image: testapp-image:latest
      # Your main application container configuration

    - name: prometheus-sidecar
      image: prometheus/prometheus:v2.30.0
      args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--web.listen-address=:9090"
      ports:
        - containerPort: 9090
      volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus/prometheus.yml
          subPath: prometheus.yml

  volumes:
    - name: config-volume
      configMap:
        name: prometheus-configmap

The True Value Of Observability

Observability impacts many areas across an organization. The investment in observability toolchains always return great value through the positive outcomes they enable for various stakeholders who are internal and external to software systems. 

Generally, observability, which is the ability to understand and analyze the behavior, performance, and health of an application or infrastructure, allows engineering teams to make more out of technology investments by improving the performance of the software. More specifically, it reduces the mean time to resolution for issues, helps optimize resource utilization, and aids further in cost savings through better capacity planning. If you’re looking for an observability solution for your Kubernetes-based applications, look no further than Lumigo. 

Kubernetes Troubleshooting with Lumigo

Lumigo is a troubleshooting platform, purpose-built for microservice-based applications. Developers using Kubernetes to orchestrate their containerized applications can use Lumigo to monitor, trace and troubleshoot issues fast. Deployed with zero-code changes and automated in one-click, Lumigo stitches together every interaction between micro and managed service into end-to-end stack traces. These traces, served alongside request payload data, give developers complete visibility into their container environments. Using Lumigo, developers get:

  • End-to-end virtual stack traces across every micro and managed service that makes up a serverless application, in context
  • API visibility that makes all the data passed between services available and accessible, making it possible to perform root cause analysis without digging through logs 
  • Distributed tracing that is deployed with no code and automated in one click 
  • Unified platform to explore and query across microservices, see a real-time view of applications, and optimize performance

To try Lumigo for Kubernetes, check out our Kubernetes operator on GitHub.

This may also interest you