Kubernetes Troubleshooting in the Cloud

Kubernetes has been used by organizations for nearly a decade – from wrapping applications inside containers, pushing them to a container repository, to full production deployment. At some point, we need to troubleshoot various issues in Kubernetes environments. In this blog post, I will review some of the common ways to troubleshoot Kubernetes, based on the hyperscale cloud environments. Common Kubernetes issues Before we deep dive into Kubernetes troubleshooting, let us review some of the common Kubernetes errors: CrashLoopBackOff - A container in a pod keeps failing to start, so Kubernetes tries to restart it over and over, waiting longer each time. This usually means there’s a problem with the app, something important is missing, or the setup is wrong. ImagePullBackOff - Kubernetes can’t download the container image for a pod. This might happen if the image name or tag is wrong, there’s a problem logging into the image registry, or there are network issues. CreateContainerConfigError - Kubernetes can’t set up the container because there’s a problem with the settings like wrong environment variables, incorrect volume mounts, or security settings that don’t work. PodInitializing - A pod is stuck starting up, usually because the initial setup containers are failing, taking too long, or there are problems with the network or attached storage. Kubectl for Kubernetes troubleshooting Kubectl is the native and recommended way to manage Kubernetes, and among others to assist in troubleshooting various aspects of Kubernetes. Below are some examples of using kubectl: View all pods and their statuses: kubectl get pods Get detailed information and recent events for a specific pod: kubectl describe pod View logs from a specific container in a multi-container pod: kubectl logs -c Open an interactive shell inside a running pod: kubectl exec -it -- /bin/bash Check the status of cluster nodes: kubectl get nodes Get detailed information about a specific node: kubectl describe node Additional information about kubectl can be found at: https://kubernetes.io/docs/reference/kubectl Remote connectivity to Kubernetes nodes In rare cases, you may need to remotely connect a Kubernetes node as part of troubleshooting. Some of the reasons to do so may be troubleshooting hardware failures, collecting system-level logs, cleaning up disk space, restarting services, etc. Below are secure ways to remotely connect to Kubernetes nodes: To connect to an Amazon EKS node using AWS Systems Manager Session Manager from the command line, use the following command: aws ssm start-session --target For more details, see: https://docs.aws.amazon.com/eks/latest/best-practices/protecting-the-infrastructure.html To connect to an Azure AKS node using Azure Bastion from the command line, run the commands below to get the private IP address of the AKS node and SSH from a bastion connected environment: az aks machine list --resource-group --cluster-name --nodepool-name -o table ssh -i /path/to/private_key.pem azureuser@ For more details, see: https://learn.microsoft.com/en-us/azure/aks/node-access To connect to a GKE node using the gcloud command combined with Identity-Aware Proxy (IAP), use the following command: gcloud compute ssh --zone --tunnel-through-iap For more details, see: https://cloud.google.com/compute/docs/connect/ssh-using-iap#gcloud Monitoring and observability To assist in troubleshooting, Kubernetes has various logs, some are enabled by default (such as container logs) and some need to be explicitly enabled (such as Control Plane logs). Below are some of the ways to collect logs in managed Kubernetes services. Amazon EKS To collect EKS node and application logs, use CloudWatch Container Insights (including resource utilization), as explained below: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/container-insights-detailed-metrics.html To collect EKS Control Plane logs to CloudWatch logs, follow the instructions below: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html To collect metrics from the EKS cluster, use Amazon Managed Service for Prometheus, as explained below: https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html Azure AKS To collect AKS node and application logs (including resource utilization), use Azure Monitor Container Insights, as explained below: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-data-collection-configure To collect AKS Control Plane logs to Azure Monitor, configure the diagnostic setting, as explained below: https://docs.azure.cn/en-us/aks/monitor-aks?tabs=cilium#aks-control-planeresource-logs To collect metrics from the AKS cluster, use Azure Monitor managed service for Prometheus, as explained below: https://learn.microsoft.com/en-us/azure/

May 4, 2025 - 15:35

Kubernetes has been used by organizations for nearly a decade – from wrapping applications inside containers, pushing them to a container repository, to full production deployment.

At some point, we need to troubleshoot various issues in Kubernetes environments.

In this blog post, I will review some of the common ways to troubleshoot Kubernetes, based on the hyperscale cloud environments.

Common Kubernetes issues

Before we deep dive into Kubernetes troubleshooting, let us review some of the common Kubernetes errors:

CrashLoopBackOff - A container in a pod keeps failing to start, so Kubernetes tries to restart it over and over, waiting longer each time. This usually means there’s a problem with the app, something important is missing, or the setup is wrong.
ImagePullBackOff - Kubernetes can’t download the container image for a pod. This might happen if the image name or tag is wrong, there’s a problem logging into the image registry, or there are network issues.
CreateContainerConfigError - Kubernetes can’t set up the container because there’s a problem with the settings like wrong environment variables, incorrect volume mounts, or security settings that don’t work.
PodInitializing - A pod is stuck starting up, usually because the initial setup containers are failing, taking too long, or there are problems with the network or attached storage.

Kubectl for Kubernetes troubleshooting

Kubectl is the native and recommended way to manage Kubernetes, and among others to assist in troubleshooting various aspects of Kubernetes.

Below are some examples of using kubectl:

View all pods and their statuses:
kubectl get pods
Get detailed information and recent events for a specific pod:
kubectl describe pod
View logs from a specific container in a multi-container pod:
kubectl logs -c
Open an interactive shell inside a running pod:
kubectl exec -it -- /bin/bash
Check the status of cluster nodes:
kubectl get nodes
Get detailed information about a specific node:
kubectl describe node

Additional information about kubectl can be found at:

https://kubernetes.io/docs/reference/kubectl

Remote connectivity to Kubernetes nodes

In rare cases, you may need to remotely connect a Kubernetes node as part of troubleshooting. Some of the reasons to do so may be troubleshooting hardware failures, collecting system-level logs, cleaning up disk space, restarting services, etc.

Below are secure ways to remotely connect to Kubernetes nodes:

To connect to an Amazon EKS node using AWS Systems Manager Session Manager from the command line, use the following command:
aws ssm start-session --target
For more details, see: https://docs.aws.amazon.com/eks/latest/best-practices/protecting-the-infrastructure.html
To connect to an Azure AKS node using Azure Bastion from the command line, run the commands below to get the private IP address of the AKS node and SSH from a bastion connected environment:
az aks machine list --resource-group --cluster-name --nodepool-name -o table
ssh -i /path/to/private_key.pem azureuser@
For more details, see: https://learn.microsoft.com/en-us/azure/aks/node-access
To connect to a GKE node using the gcloud command combined with Identity-Aware Proxy (IAP), use the following command:
gcloud compute ssh --zone --tunnel-through-iap
For more details, see: https://cloud.google.com/compute/docs/connect/ssh-using-iap#gcloud
Monitoring and observability

To assist in troubleshooting, Kubernetes has various logs, some are enabled by default (such as container logs) and some need to be explicitly enabled (such as Control Plane logs).

Below are some of the ways to collect logs in managed Kubernetes services.
Amazon EKS

To collect EKS node and application logs, use CloudWatch Container Insights (including resource utilization), as explained below:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/container-insights-detailed-metrics.html

To collect EKS Control Plane logs to CloudWatch logs, follow the instructions below:

https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html

To collect metrics from the EKS cluster, use Amazon Managed Service for Prometheus, as explained below:

https://docs.aws.amazon.com/eks/latest/userguide/prometheus.html
Azure AKS

To collect AKS node and application logs (including resource utilization), use Azure Monitor Container Insights, as explained below:

https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-data-collection-configure

To collect AKS Control Plane logs to Azure Monitor, configure the diagnostic setting, as explained below:

https://docs.azure.cn/en-us/aks/monitor-aks?tabs=cilium#aks-control-planeresource-logs

To collect metrics from the AKS cluster, use Azure Monitor managed service for Prometheus, as explained below:

https://learn.microsoft.com/en-us/azure/azure-monitor/metrics/prometheus-metrics-overview#enable-azure-monitor-managed-service-for-prometheus
Google GKE

GKE node and Pod logs are sent automatically to Google Cloud Logging, as documented below:

https://cloud.google.com/kubernetes-engine/docs/concepts/about-logs#collecting_logs

GKE Control Plane logs are not enabled by default, as documented below:

https://cloud.google.com/kubernetes-engine/docs/how-to/view-logs

To collect metrics from the GKE cluster, use Google Cloud Managed Service for Prometheus, as explained below:

https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed

Enable GKE usage metering to collect resource utilization, as documented below:

https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-usage-metering#enabling
Troubleshooting network connectivity issues

There may be various network-related issues when managing Kubernetes clusters. Some of the common network issues are nodes or pods not joining the cluster, inter-pod communication failures, connectivity to the Kubernetes API server, etc.

Below are some of the ways to troubleshoot network issues in managed Kubernetes services.
Amazon EKS

Enable network policy logs to investigate network connection through Amazon VPC CNI, as explained below:

https://docs.aws.amazon.com/eks/latest/userguide/network-policies-troubleshooting.html

Use the guide below for monitoring network performance issues in EKS:

https://docs.aws.amazon.com/eks/latest/best-practices/monitoring_eks_workloads_for_network_performance_issues.html

Temporary enable VPC flow logs (due to high storage cost in large production environments) to query network traffic of EKS clusters deployed in a dedicated subnet, as explained below:

https://aws.amazon.com/blogs/networking-and-content-delivery/using-vpc-flow-logs-to-capture-and-query-eks-network-communications/
Azure AKS

Use the guide below to troubleshoot connectivity issues to the AKS API server:

https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/connectivity/troubleshoot-cluster-connection-issues-api-server

Use the guide below to troubleshoot outbound connectivity issues from the AKS cluster:

https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections

Use the guide below to troubleshoot connectivity issues to applications deployed on top of AKS:

https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/connectivity/connection-issues-application-hosted-aks-cluster

Use Container Network Observability to troubleshoot connectivity issues within an AKS cluster, as explained below:

https://learn.microsoft.com/en-us/azure/aks/container-network-observability-concepts
Google GKE

Use the guide below to troubleshoot network connectivity issues in GKE:

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/connectivity-issues-in-cluster

Temporary enable VPC flow logs (due to high storage cost in large production environments) to query network traffic of GKR clusters deployed in a dedicated subnet, as explained below:

https://cloud.google.com/vpc/docs/using-flow-logs
Summary

I am sure there are many more topics to cover when troubleshooting problems with Kubernetes clusters, however, in this blog post, I highlighted the most common cases.

Kubernetes by itself is an entire domain of expertise and requires many hours to deep dive into and understand.

I strongly encourage anyone using Kubernetes to read vendors' documentation, and practice in development environments, until you gain hands-on experience running production environments.
About the author

Eyal Estrin is a cloud and information security architect, an AWS Community Builder, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.