Kubernetes OOMKilled Error: How to Fix and Tips for Preventing It

  • Topics

What Is Kubernetes OOMKilled (Exit Code 137)? 

The Kubernetes OOMKilled (Exit Code 137) is a signal sent by the Linux Kernel to terminate a process due to an Out Of Memory (OOM) condition. This event is usually an indication that a container in a pod has exceeded its memory limit and the system cannot allocate additional memory. When a container is terminated due to an OOM condition, Kubernetes marks it as OOMKilled, and the exit code 137 is logged for troubleshooting.

Understanding how Kubernetes deals with system resources, particularly memory, is vital to managing and preventing OOMKilled events. Kubernetes uses cgroups (control groups), a Linux kernel feature, to limit the resource usage of processes. When a container in Kubernetes is created, it is assigned to a specific cgroup. The cgroup has a defined amount of memory that the container can use. If a container tries to consume more memory than its cgroup allows, the Linux kernel triggers an OOM condition, leading to the OOMKilled event.

This is part of a series of articles about Kubernetes troubleshooting

How Does the Linux OOM Killer Mechanism Work?

The Out of Memory (OOM) Killer is a process that the Linux kernel employs when the system is critically low on memory. The kernel itself cannot request additional memory, so it must free up memory from somewhere else. The OOM Killer’s role is to select processes to terminate to reclaim memory.

The OOM Killer uses an algorithm to determine which processes to target. It assigns a score, or ‘oom_score’, to each process based on factors such as the amount of memory it is using, how long it has been running, and its priority level. The process with the highest oom_score is selected for termination.

It’s important to note that the OOM Killer is a necessary component of the Linux kernel that helps ensure the stability of the system when faced with memory pressure. By understanding how the OOM Killer mechanism functions, we can better design and configure our Kubernetes applications to avoid OOMKilled events.

Common Causes of OOMKilled 

OOMKilled events can be triggered by a variety of factors. Here are some of the most common causes:

Misconfigured Memory Limits

One of the most common causes of OOMKilled events is misconfigured memory limits. When deploying a container in Kubernetes, it’s essential to set appropriate memory limits. If a pod is allocated less memory than it needs to function correctly, it will attempt to consume more memory, leading to an OOMKilled event.

To avoid this, it’s important to understand the memory requirements of your application. Monitor the memory usage of your application under different load scenarios to get a clear picture of its memory needs. Then, set the memory limits accordingly in your Kubernetes deployment configuration.

Memory Leaks in Applications

Another common cause of OOMKilled events is memory leaks in applications. A memory leak occurs when a program consumes memory but does not release it back to the system after it’s done using it. Over time, this can lead to an increase in the memory usage of the application, eventually triggering an OOMKilled event.

Identifying and fixing memory leaks can be a challenging task. It requires a deep understanding of the programming language and the application’s codebase. However, it’s a crucial part of preventing OOMKilled events.

Node Memory Pressure

Node memory pressure is another factor that can lead to OOMKilled events. When a node in a Kubernetes cluster is under memory pressure, it means that the node’s available memory is low. This can happen if too many pods are scheduled on a single node, or if the pods running on the node consume more memory than anticipated.

To mitigate node memory pressure, it’s important to monitor the memory usage of your nodes regularly. If a node is consistently under memory pressure, consider adding more nodes to your cluster or rescheduling some pods to other nodes.

Unbounded Resource Consumption

Unbounded resource consumption is another common cause of OOMKilled events. This happens when an application or process consumes an unlimited amount of system resources, including memory. This can happen due to a bug in the application or because the application is designed to consume resources aggressively.

To prevent unbounded resource consumption, it’s important to design your applications with resource limits in mind. Implement mechanisms in your application to limit resource consumption, such as limiting the number of concurrent connections or requests.

Diagnosing and Debugging OOMKilled Issues in Kubernetes 

Inspecting Logs and Events

The first step in diagnosing Kubernetes OOMKilled (Exit Code 137) is inspecting logs and events. Logs are the breadcrumbs that applications leave behind, offering a wealth of information about what was happening at the time of the issue. Kubernetes provides various logs, such as pod logs, event logs, and system logs, each serving a specific purpose.

Pod logs are the output of the containers running in a pod. They can provide insights into error messages generated by your application or the runtime. Event logs, on the other hand, show significant state changes in a pod’s lifecycle, such as scheduling, pulling images, and killing containers. Finally, system logs refer to logs from Kubernetes system components like the kubelet or API server.

To effectively inspect logs and events, it is essential to familiarize yourself with kubectl, Kubernetes’ command-line tool. With the right kubectl commands, you can retrieve logs, describe pods, or get events, providing a clearer picture of what might have caused the OOMKilled status.

Examining Resource Quotas and Limits

The next step in diagnosing Kubernetes OOMKilled (Exit Code 137) is examining resource quotas and limits. Kubernetes allows us to set resource quotas at the namespace level and resource limits at the container level. These settings help to ensure fair allocation of resources among pods and prevent any single pod from hogging resources.

When a container exceeds its resource limit, the Kubernetes system kills it, leading to the OOMKilled status. You can inspect the resource usage of your pods using kubectl describe pod, which will provide information on both the requested and the actual usage. If you find that your pods are consistently reaching or exceeding their resource limits, it might be time to reassess your resource allocation.

Related content: Read our guide to kubectl restart pod

Analyzing Application Code

If the logs, events, and resource usage data don’t provide a clear picture, it might be time to look at the application code. The code could be consuming more memory than expected due to a bug, a memory leak, or inefficient use of data structures.

Analyzing application code can be a complex task, especially when dealing with large codebases or unfamiliar programming languages. However, various tools can help, such as profiling tools, memory analyzers, or even simple log statements to track memory usage. Remember, the goal is to identify sections of code that consume excessive memory, so focus your efforts on suspicious areas or places where large data structures are handled.

Best Practices to Prevent OOMKilled Status 

Properly Setting Memory Requests and Limits

The first step to prevent Kubernetes OOMKilled (Exit Code 137) is to properly set memory requests and limits. Memory requests tell the Kubernetes scheduler how much memory to reserve for a pod, while memory limits define the maximum amount of memory a pod can use.

Setting these values appropriately is a balancing act. If requests are too low, your pods might not have enough memory to function correctly, leading to OOMKilled status. Set them too high, and you risk wasting resources and reducing the overall efficiency of your cluster. As a best practice, monitor your application’s memory usage over time and adjust the requests and limits accordingly.

Monitoring and Alerting

Another crucial practice to prevent OOMKilled status is implementing robust monitoring and alerting. Monitoring allows you to keep track of your cluster’s health and performance, while alerting notifies you of potential issues before they escalate into major problems.

There are several monitoring tools available for Kubernetes, such as Prometheus and Grafana, which can provide detailed insight into your cluster’s performance. These tools can monitor metrics like CPU usage, memory usage, network bandwidth, and more.

Moreover, setting up alerting rules can help you detect when a pod’s memory usage is approaching its limit, allowing you to take preventive action. Alerts can be set up through email, Slack, or any other communication platform your team uses.

Implementing Resource Quotas

Resource quotas are another powerful tool to prevent Kubernetes OOMKilled (Exit Code 137). By setting resource quotas, you can limit the amount of CPU and memory resources that each namespace can consume, ensuring fair allocation of resources and preventing any single namespace from overloading the system.

Setting resource quotas requires careful planning. You need to consider your application’s requirements, the capacity of your cluster, and the number of namespaces. Once set, you can use kubectl describe namespace to monitor the usage of resources against the quotas.

Code Optimization and Testing

Finally, optimizing your application code and conducting thorough testing can help prevent OOMKilled status. Code optimization involves improving your code to make it more efficient and reduce its memory footprint. This could mean refactoring complex functions, optimizing data structures, or eliminating memory leaks.

Testing, on the other hand, involves running your application under different scenarios to identify potential issues. This could include stress testing, where you push your application to its limits to see how it performs, or endurance testing, where you run your application over a prolonged period to detect memory leaks or other long-term issues.

Kubernetes Troubleshooting with Lumigo

Lumigo is a troubleshooting platform, purpose-built for microservice-based applications. Developers using Kubernetes to orchestrate their containerized applications can use Lumigo to monitor, trace and troubleshoot issues in Python, Java and Node.js apps automatically within a kubernetes namespace. Deployed with zero-code changes and automated in one-click, Lumigo stitches together every interaction between micro and managed service into end-to-end stack traces. These traces, served alongside request payload data, give developers complete visibility into their container environments. Using Lumigo, developers get:

  • End-to-end virtual stack traces across every micro and managed service that makes up a serverless application, in context
  • API visibility that makes all the data passed between services available and accessible, making it possible to perform root cause analysis without digging through logs 
  • Distributed tracing that is deployed with no code and automated in one click 
  • Unified platform to explore and query across microservices, see a real-time view of applications, and optimize performance

To try Lumigo for Kubernetes, check out our Kubernetes operator on GitHub.

Learn more about Lumigo

Debug fast and move on.

  • Resolve issues 3x faster
  • Reduce error rate
  • Speed up development
No code, 5-minute set up
Start debugging free