Check Cluster Health

Regularly checking your Kubernetes cluster health helps you catch issues before they impact your applications. This guide shows you how to quickly assess the overall health of your cluster.

Quick Health Check

1. Check Node Status

# List all nodes and their status
kubectl get nodes

# Detailed node information
kubectl get nodes -o wide

# Describe a specific node
kubectl describe node <node-name>

What to Look For:

Status should be Ready
All nodes should be in Ready state
Check age - recently created nodes might still be initializing

Example Output:

NAME           STATUS   ROLES           AGE   VERSION
control-plane  Ready    control-plane   30d   v1.28.0
worker-1       Ready    <none>          30d   v1.28.0
worker-2       Ready    <none>          30d   v1.28.0

2. Check Node Conditions

Node conditions provide detailed health information:

# View node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[*]}{.type}{"="}{.status}{"\t"}{end}{"\n"}{end}'

Key Conditions:

Ready: Node is healthy and ready to accept pods (True = good)
MemoryPressure: Node has insufficient memory (False = good)
DiskPressure: Node has insufficient disk space (False = good)
PIDPressure: Node has insufficient process IDs (False = good)
NetworkUnavailable: Node network is not configured (False = good)

Using describe for detailed conditions:

kubectl describe node <node-name>

Look for the Conditions section:

Conditions:
  Type                 Status  Reason                       Message
  ----                 ------  ------                       -------
  NetworkUnavailable   False   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    KubeletReady                  kubelet is posting ready status

3. Check Control Plane Components

# Check component statuses (legacy command)
kubectl get componentstatuses

# Check control plane pods
kubectl get pods -n kube-system

# Check API server health
kubectl get --raw /healthz

# Check API server readiness
kubectl get --raw /readyz

Key Components:

kube-apiserver: API server pods should be running
kube-controller-manager: Controller manager should be running
kube-scheduler: Scheduler should be running
etcd: etcd pods should be running (if self-hosted)

Example:

kubectl get pods -n kube-system | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"

4. Check kubelet Health

# Check kubelet status on a node
# SSH to node or use kubectl exec if you have access
systemctl status kubelet

# Check kubelet logs (on the node)
journalctl -u kubelet -n 50

# Check kubelet service (from cluster)
# View node status for kubelet conditions
kubectl describe node <node-name> | grep -A 5 "KubeletReady"

What to Check:

kubelet service should be active and running
No frequent restarts
Logs should show successful node heartbeat

5. Check Pod Distribution

# Check pods per node
kubectl get pods -o wide --all-namespaces | awk '{print $7}' | sort | uniq -c

# See pods that can't be scheduled
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Check for CrashLoopBackOff pods
kubectl get pods --all-namespaces | grep CrashLoopBackOff

6. Check Resource Usage

# Node resource usage (requires metrics-server)
kubectl top nodes

# Pod resource usage
kubectl top pods --all-namespaces

# Node capacity vs allocatable
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"

Comprehensive Health Check Script

Create a simple script to check everything:

#!/bin/bash

echo "=== Node Status ==="
kubectl get nodes

echo -e "\n=== Node Conditions ==="
for node in $(kubectl get nodes -o name); do
  echo "$node:"
  kubectl get $node -o jsonpath='{range .status.conditions[*]}{.type}={.status} {end}' && echo
done

echo -e "\n=== Control Plane Pods ==="
kubectl get pods -n kube-system | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"

echo -e "\n=== API Server Health ==="
kubectl get --raw /healthz && echo " OK" || echo " FAILED"

echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

echo -e "\n=== CrashLoopBackOff Pods ==="
kubectl get pods --all-namespaces | grep CrashLoopBackOff || echo "None"

echo -e "\n=== Node Resource Usage ==="
kubectl top nodes 2>/dev/null || echo "metrics-server not available"

Common Issues and Solutions

Node NotReady

Symptoms:

NAME      STATUS     ROLES    AGE   VERSION
node-1    NotReady   <none>   5d    v1.28.0

Check:

kubectl describe node node-1
# Look for:
# - Network issues
# - kubelet not running
# - Resource pressure

Solutions:

Check kubelet service: systemctl status kubelet
Check network connectivity
Check resource pressure (memory, disk)
Review kubelet logs

Memory/Disk Pressure

Symptoms:

Node condition shows MemoryPressure=True or DiskPressure=True
Pods can't be scheduled
Existing pods might be evicted

Check:

kubectl describe node <node-name> | grep -A 2 "Pressure"

Solutions:

Free up disk space
Add more memory or nodes
Clean up unused images: docker system prune (if using Docker)
Remove unused volumes

Control Plane Issues

Symptoms:

API server not responding
kubectl commands fail
Components showing as unavailable

Check:

kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node>

Solutions:

Restart control plane components
Check etcd health (if self-hosted)
Verify network connectivity between components
Check logs for errors

High Resource Usage

Symptoms:

Node showing high CPU/memory usage
Pods being throttled
Slow application performance

Check:

kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu

Solutions:

Identify resource-heavy pods
Add resource limits to pods
Scale cluster horizontally
Optimize application resource usage

Regular Health Check Schedule

Daily:

Quick node status check
Pending/CrashLoopBackOff pods

Weekly:

Full cluster health check
Resource usage review
Control plane component status

Monthly:

Capacity planning review
Node condition deep dive
Historical trend analysis

Monitoring Tools

While manual checks are important, consider setting up:

Prometheus + Grafana: Comprehensive metrics and alerting
kubectl-who-can: Audit permissions
kube-score: Static analysis of YAML files
Popeye: Kubernetes cluster sanitizer

Key Takeaways

Check nodes first - They're the foundation of your cluster
Monitor conditions - They provide early warning signs
Watch control plane - If it's unhealthy, nothing works
Track resource usage - Prevent capacity issues
Automate checks - Regular automated health checks catch issues early

A healthy cluster is the foundation for reliable applications. Regular health checks help you catch and fix issues before they impact your users!