Kubernetes troubleshooting involves identifying, investigating, and resolving issues within a Kubernetes cluster. This can involve investigating problems with containerized applications, the control plane, or the underlying infrastructure.
While Kubernetes troubleshooting is complex, by employing best practices, log analysis, and automated tools, administrators can effectively pinpoint and address issues to ensure a reliable, high-performing Kubernetes environment.
This is part of an extensive series of guides about DevOps.
In this article
Kubernetes troubleshooting is complicated due to several factors:
Production environments in Kubernetes can be highly complex, with numerous interconnected components such as containers, nodes, and services working together to form the overall system. This intricate architecture makes it challenging to pinpoint the exact root cause of issues and requires deep expertise to effectively diagnose and resolve problems.
In addition, Kubernetes clusters often host multiple microservices built by different teams. Each team may follow unique development methodologies, use various programming languages, and have distinct dependencies. This diversity makes it difficult to maintain consistency across the cluster, leading to potential conflicts and problems that can be hard to troubleshoot.
To tackle these challenges, close coordination among development, operations, and security teams is essential. Establishing clear lines of communication and collaboration helps ensure that issues are identified and resolved more efficiently. Additionally, using appropriate tools, such as monitoring and observability platforms, can aid in detecting anomalies and providing insights into the cluster’s health.
Here are some of the main errors you might encounter in Kubernetes, and quick advice for how to resolve them.
The CrashLoopBackOff
message indicates that Kubernetes was unable to schedule a given pod on a node. This error occurs when the node lacks the necessary resources to run the pod, or the required volumes failed to mount.
Identifying the error
You can identify this error by using the kubectl get pods
command. The affected pod should show a CrashLoopBackOff
status.
You can view details about the pod to further investigate the error by using the kubectl describe pod
Resolving the error
Here are some of the main causes of the CrashLoopBackOff
error and how to address them:
This status indicates that the pod could not start on the specified node and Kubernetes was unable to pull the container image (thus the status refers to ImagePull). The BackOff element in the message indicates that Kubernetes will continue to try and pull the image.
Identifying the error
To troubleshoot the ImagePullBackOff error, run the kubectl describe pod
command to collect information about the pod and save the output to a text file: /tmp/troubleshooting_describe_pod.txt
Next, check the Events section in the describe command’s output. Look for one of these messages in the text file:
Repository ... does not exist or no pull access
Manifest ... not found
authorization failed
Resolving the error
If the error message is Repository … does not exist or no pull access, it indicates that the Docker registry used by the cluster does not contain the pod’s specified repository. Kubernetes pulls images from Docker Hub by default, but the specific cluster could be using a private registry. Sometimes, this error occur because the repository name specified by the pod is incorrect.
This error can be resolved by checking the pod’s specification and ensuring that the image and repository are correctly defined. If the error remains unresolved, the cause could be a network issues that is blocking access to the registry. In this case, check the node’s hostname in the pod description (text file) and download it manually.
If the error is Manifest ... not found
, Kubernetes couldn’t find the specific version of the container image. The specified tag might be incorrect, so you should check that it is correct and is included in the repository.
If the message is authorization failed
, the credentials provided when requesting an image or container registry failed to authorize. You can resolve this by creating a secret with the correct credentials and referencing it in the pod’s specification. Make sure the secret’s credentials have the necessary permissions to access the image or repository.
This error indicates that the container was terminated due an application error or invalid reference:
Resolving this error
To address an Exit Code 1
error, verify that all files defined in the image spec exist in the container log. If one of the files is missing, this indicates an invalid reference. You can modify the image specification to point to the appropriate file, specifying the right path and name. If you cannot identify an invalid reference, check for application errors and debug the affected library.
The Exit Code 125
error code occurs when a container fails to run because the command invoked by Kubernetes in the system shell didn’t execute properly. For instance, you might try to execute the docker run command but fail. This error is often caused by the following:
Resolving the error
To address this error:
A node displays a NotReady
status when it shuts down or crashes and the stateful pods inside it are unavailable. If this error persists for longer than five minutes, the status of the scheduled pods will become Unknown
. Kubernetes will then attempt to reschedule the pods on a different node, assigning it a status of ContainerCreating.
Identifying the error
You can identify the affected pods using the kubectl get pods
command. You can check if the scheduled pods have been moved to other nodes by using the get nodes
command and seeing if the output shows the same pod on more than one node.
Resolving the error
Sometimes, this issue resolves itself when the failed node recovers, or if you reboot it. When the node recovers, it joins the cluster and the pod with the Unknown
status is deleted. Kubernetes detaches the failed node’s volumes. Once the pod is rescheduled on another node, Kubernetes attaches the necessary volumes and its status will become ContainerCreating
. After a default, five-minute waiting period, the status should switch to Running.
However, you may need to guide Kubernetes to reschedule the stateful pods on a new node if the failed node doesn’t recover in time. You can do this by removing the failed node with the kubectl delete node
. Alternatively, delete the stateful pods with an Unknown
status by using the following command:
kubectl delete pods pod-name --grace-period=0 --force -n example-namespace</code>
Lumigo is a troubleshooting platform, purpose-built for microservice-based applications. Developers using Kubernetes to orchestrate their containerized applications can use Lumigo to monitor, trace and troubleshoot issues fast. Deployed with zero-code changes and automated in one-click, Lumigo stitches together every interaction between micro and managed service into end-to-end stack traces. These traces, served alongside request payload data, give developers complete visibility into their container environments. Using Lumigo, developers get:
To try Lumigo for Kubernetes, check out our Kubernetes operator on GitHub.
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of DevOps.
Authored by Anodot
Authored by Codefresh
Authored by Configu