Kubernetes Troubleshooting in the Cloud

Kubernetes has been used by organizations for nearly a decade – from wrapping applications inside containers, pushing them to a container repository, to full production deployment. At some point, we need to troubleshoot various issues in Kubernetes environments. In this blog post, I will review some of the common ways to troubleshoot Kubernetes, based on the hyperscale cloud environments. Common Kubernetes issues Before we deep dive into Kubernetes troubleshooting, let us review some of the common Kubernetes errors: CrashLoopBackOff - A container in a pod keeps failing to start, so Kubernetes tries to restart it over and over, waiting longer each time. This usually means there’s a problem with the app, something important is missing, or the setup is wrong. ImagePullBackOff - Kubernetes can’t download the container image for a pod. This might happen if the image name or tag is wrong, there’s a problem logging into the image registry, or there are network issues. CreateContainerConfigError - Kubernetes can’t set up the container because there’s a problem with the settings like wrong environment variables, incorrect volume mounts, or security settings that don’t work. PodInitializing - A pod is stuck starting up, usually because the initial setup containers are failing, taking too long, or there are problems with the network or attached storage. Kubectl for Kubernetes troubleshooting Kubectl is the native and recommended way to manage Kubernetes, and among others to assist in troubleshooting various aspects of Kubernetes. Below are some examples of using kubectl: View all pods and their statuses: kubectl get pods Get detailed information and recent events for a specific pod: kubectl describe pod View logs from a specific container in a multi-container pod: kubectl logs -c Open an interactive shell inside a running pod: kubectl exec -it -- /bin/bash Check the status of cluster nodes: kubectl get nodes Get detailed information about a specific node: kubectl describe node Additional information about kubectl can be found at: https://kubernetes.io/docs/reference/kubectl Remote connectivity to Kubernetes nodes In rare cases, you may need to remotely connect a Kubernetes node as part of troubleshooting. Some of the reasons to do so may be troubleshooting hardware failures, collecting system-level logs, cleaning up disk space, restarting services, etc. Below are secure ways to remotely connect to Kubernetes nodes: To connect to an Amazon EKS node using AWS Systems Manager Session Manager from the command line, use the following command: aws ssm start-session --target For more details, see: https://docs.aws.amazon.com/eks/latest/best-practices/protecting-the-infrastructure.html To connect to an Azure AKS node using Azure Bastion from the command line, run the commands below to get the private IP address of the AKS node and SSH from a bastion connected environment: az aks machine list --resource-group --cluster-name --nodepool-name -o table ssh -i /path/to/private_key.pem azureuser@ For more details, see: https://learn.microsoft.com/en-us/azure/aks/node-access To connect to a GKE node using the gcloud command combined with Identity-Aware Proxy (IAP), use the following command: gcloud compute ssh --zone --tunnel-through-iap For more details, see: https://cloud.google.com/compute/docs/connect/ssh-using-iap#gcloud Monitoring and observability To assist in troubleshooting, Kubernetes has various logs, some are enabled by default (such as container logs) and some need to be explicitly enabled (such as Control Plane logs). Below are some of the ways to collect logs in managed Kubernetes services. Amazon EKS To collect EKS node and application logs, use CloudWatch Container Insights (including resource utilization), as explained below: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/container-insights-detailed-metrics.html To collect EKS Control Plane logs to CloudWatch logs, follow the instructions below: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html To collect metrics from the EKS cluster, use Amazon Managed Service for Prometheus, as explained below: https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html Azure AKS To collect AKS node and application logs (including resource utilization), use Azure Monitor Container Insights, as explained below: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-data-collection-configure To collect AKS Control Plane logs to Azure Monitor, configure the diagnostic setting, as explained below: https://docs.azure.cn/en-us/aks/monitor-aks?tabs=cilium#aks-control-planeresource-logs To collect metrics from the AKS cluster, use Azure Monitor managed service for Prometheus, as explained below: https://learn.microsoft.com/en-us/azure/

May 4, 2025 - 15:35
 0
Kubernetes Troubleshooting in the Cloud

Kubernetes has been used by organizations for nearly a decade – from wrapping applications inside containers, pushing them to a container repository, to full production deployment.

At some point, we need to troubleshoot various issues in Kubernetes environments.

In this blog post, I will review some of the common ways to troubleshoot Kubernetes, based on the hyperscale cloud environments.

Common Kubernetes issues

Before we deep dive into Kubernetes troubleshooting, let us review some of the common Kubernetes errors:

  • CrashLoopBackOff - A container in a pod keeps failing to start, so Kubernetes tries to restart it over and over, waiting longer each time. This usually means there’s a problem with the app, something important is missing, or the setup is wrong.
  • ImagePullBackOff - Kubernetes can’t download the container image for a pod. This might happen if the image name or tag is wrong, there’s a problem logging into the image registry, or there are network issues.
  • CreateContainerConfigError - Kubernetes can’t set up the container because there’s a problem with the settings like wrong environment variables, incorrect volume mounts, or security settings that don’t work.
  • PodInitializing - A pod is stuck starting up, usually because the initial setup containers are failing, taking too long, or there are problems with the network or attached storage.

Kubectl for Kubernetes troubleshooting

Kubectl is the native and recommended way to manage Kubernetes, and among others to assist in troubleshooting various aspects of Kubernetes.

Below are some examples of using kubectl:

  • View all pods and their statuses:

    kubectl get pods

  • Get detailed information and recent events for a specific pod:

    kubectl describe pod

  • View logs from a specific container in a multi-container pod:

    kubectl logs -c

  • Open an interactive shell inside a running pod:

    kubectl exec -it -- /bin/bash

  • Check the status of cluster nodes:

    kubectl get nodes

  • Get detailed information about a specific node:

    kubectl describe node

Additional information about kubectl can be found at:

https://kubernetes.io/docs/reference/kubectl

Remote connectivity to Kubernetes nodes

In rare cases, you may need to remotely connect a Kubernetes node as part of troubleshooting. Some of the reasons to do so may be troubleshooting hardware failures, collecting system-level logs, cleaning up disk space, restarting services, etc.

Below are secure ways to remotely connect to Kubernetes nodes: