Check Cluster Health
Regularly checking your Kubernetes cluster health helps you catch issues before they impact your applications. This guide shows you how to quickly assess the overall health of your cluster.
Quick Health Check
1. Check Node Status
# List all nodes and their status
kubectl get nodes
# Detailed node information
kubectl get nodes -o wide
# Describe a specific node
kubectl describe node <node-name>
What to Look For:
- Status should be
Ready - All nodes should be in
Readystate - Check age - recently created nodes might still be initializing
Example Output:
NAME STATUS ROLES AGE VERSION
control-plane Ready control-plane 30d v1.28.0
worker-1 Ready <none> 30d v1.28.0
worker-2 Ready <none> 30d v1.28.0
2. Check Node Conditions
Node conditions provide detailed health information:
# View node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[*]}{.type}{"="}{.status}{"\t"}{end}{"\n"}{end}'
Key Conditions:
- Ready: Node is healthy and ready to accept pods (
True= good) - MemoryPressure: Node has insufficient memory (
False= good) - DiskPressure: Node has insufficient disk space (
False= good) - PIDPressure: Node has insufficient process IDs (
False= good) - NetworkUnavailable: Node network is not configured (
False= good)
Using describe for detailed conditions:
kubectl describe node <node-name>
Look for the Conditions section:
Conditions:
Type Status Reason Message
---- ------ ------ -------
NetworkUnavailable False FlannelIsUp Flannel is running on this node
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasSufficientDisk kubelet has sufficient disk space available
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready True KubeletReady kubelet is posting ready status
3. Check Control Plane Components
# Check component statuses (legacy command)
kubectl get componentstatuses
# Check control plane pods
kubectl get pods -n kube-system
# Check API server health
kubectl get --raw /healthz
# Check API server readiness
kubectl get --raw /readyz
Key Components:
- kube-apiserver: API server pods should be running
- kube-controller-manager: Controller manager should be running
- kube-scheduler: Scheduler should be running
- etcd: etcd pods should be running (if self-hosted)
Example:
kubectl get pods -n kube-system | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"
4. Check kubelet Health
# Check kubelet status on a node
# SSH to node or use kubectl exec if you have access
systemctl status kubelet
# Check kubelet logs (on the node)
journalctl -u kubelet -n 50
# Check kubelet service (from cluster)
# View node status for kubelet conditions
kubectl describe node <node-name> | grep -A 5 "KubeletReady"
What to Check:
- kubelet service should be active and running
- No frequent restarts
- Logs should show successful node heartbeat
5. Check Pod Distribution
# Check pods per node
kubectl get pods -o wide --all-namespaces | awk '{print $7}' | sort | uniq -c
# See pods that can't be scheduled
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Check for CrashLoopBackOff pods
kubectl get pods --all-namespaces | grep CrashLoopBackOff
6. Check Resource Usage
# Node resource usage (requires metrics-server)
kubectl top nodes
# Pod resource usage
kubectl top pods --all-namespaces
# Node capacity vs allocatable
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"
Comprehensive Health Check Script
Create a simple script to check everything:
#!/bin/bash
echo "=== Node Status ==="
kubectl get nodes
echo -e "\n=== Node Conditions ==="
for node in $(kubectl get nodes -o name); do
echo "$node:"
kubectl get $node -o jsonpath='{range .status.conditions[*]}{.type}={.status} {end}' && echo
done
echo -e "\n=== Control Plane Pods ==="
kubectl get pods -n kube-system | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd"
echo -e "\n=== API Server Health ==="
kubectl get --raw /healthz && echo " OK" || echo " FAILED"
echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
echo -e "\n=== CrashLoopBackOff Pods ==="
kubectl get pods --all-namespaces | grep CrashLoopBackOff || echo "None"
echo -e "\n=== Node Resource Usage ==="
kubectl top nodes 2>/dev/null || echo "metrics-server not available"
Common Issues and Solutions
Node NotReady
Symptoms:
NAME STATUS ROLES AGE VERSION
node-1 NotReady <none> 5d v1.28.0
Check:
kubectl describe node node-1
# Look for:
# - Network issues
# - kubelet not running
# - Resource pressure
Solutions:
- Check kubelet service:
systemctl status kubelet - Check network connectivity
- Check resource pressure (memory, disk)
- Review kubelet logs
Memory/Disk Pressure
Symptoms:
- Node condition shows
MemoryPressure=TrueorDiskPressure=True - Pods can't be scheduled
- Existing pods might be evicted
Check:
kubectl describe node <node-name> | grep -A 2 "Pressure"
Solutions:
- Free up disk space
- Add more memory or nodes
- Clean up unused images:
docker system prune(if using Docker) - Remove unused volumes
Control Plane Issues
Symptoms:
- API server not responding
kubectlcommands fail- Components showing as unavailable
Check:
kubectl get pods -n kube-system
kubectl logs -n kube-system kube-apiserver-<node>
Solutions:
- Restart control plane components
- Check etcd health (if self-hosted)
- Verify network connectivity between components
- Check logs for errors
High Resource Usage
Symptoms:
- Node showing high CPU/memory usage
- Pods being throttled
- Slow application performance
Check:
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
Solutions:
- Identify resource-heavy pods
- Add resource limits to pods
- Scale cluster horizontally
- Optimize application resource usage
Regular Health Check Schedule
Daily:
- Quick node status check
- Pending/CrashLoopBackOff pods
Weekly:
- Full cluster health check
- Resource usage review
- Control plane component status
Monthly:
- Capacity planning review
- Node condition deep dive
- Historical trend analysis
Monitoring Tools
While manual checks are important, consider setting up:
- Prometheus + Grafana: Comprehensive metrics and alerting
- kubectl-who-can: Audit permissions
- kube-score: Static analysis of YAML files
- Popeye: Kubernetes cluster sanitizer
Key Takeaways
- Check nodes first - They're the foundation of your cluster
- Monitor conditions - They provide early warning signs
- Watch control plane - If it's unhealthy, nothing works
- Track resource usage - Prevent capacity issues
- Automate checks - Regular automated health checks catch issues early
A healthy cluster is the foundation for reliable applications. Regular health checks help you catch and fix issues before they impact your users!