Monitor Pods & Resources
Monitoring your pods is essential for maintaining application reliability. Understanding what to watch and why helps you catch issues early and optimize resource usage.
What to Monitor
1. Resource Usage (CPU & Memory)
Why it matters:
- Prevents resource exhaustion
- Identifies memory leaks
- Ensures proper resource allocation
- Detects performance bottlenecks
How to check:
# Current resource usage
kubectl top pods
# Per namespace
kubectl top pods -n production
# All namespaces
kubectl top pods --all-namespaces
# Sorted by CPU
kubectl top pods --sort-by=cpu
# Sorted by memory
kubectl top pods --sort-by=memory
# Specific pod
kubectl top pod <pod-name>
What to look for:
- Pods using close to their limits
- Pods with consistently high CPU (potential infinite loops)
- Memory usage trending upward (possible leaks)
- Pods with zero resource usage (might be idle)
2. Restart Count
Why it matters:
- Indicates application instability
- Shows if health checks are failing
- Highlights configuration issues
- Reveals resource constraint problems
How to check:
# Pods with restarts
kubectl get pods
# Filter pods with restarts
kubectl get pods --all-namespaces | awk '$4>0 {print}'
# Watch restarts in real-time
kubectl get pods -w
# Describe to see restart reason
kubectl describe pod <pod-name>
What to look for:
- Restart count increasing (investigate immediately)
- Recent restarts (check events and logs)
- CrashLoopBackOff status
- RestartLoopOff status
Common causes:
- Application crashes
- Failed liveness probes
- Out of memory (OOMKilled)
- Configuration errors
- Missing dependencies
3. Container State
Why it matters:
- Shows current pod health
- Identifies stuck containers
- Reveals startup issues
- Highlights readiness problems
How to check:
# Get pod status
kubectl get pods
# Detailed state information
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[*].state}{"\n"}{end}'
# Describe for detailed state
kubectl describe pod <pod-name>
States to watch:
- Running: Normal operation
- Waiting: Container waiting to start (check reason)
- Terminated: Container stopped (check exit code)
- CrashLoopBackOff: Container repeatedly crashing
- ImagePullBackOff: Can't pull container image
- ErrImagePull: Image pull failed
4. CPU Throttling
Why it matters:
- Indicates CPU limits are too low
- Causes application slowdowns
- Leads to poor user experience
- Shows resource planning issues
How to check:
# Check if metrics-server shows throttling (advanced)
# Requires prometheus or similar
# Check CPU limits
kubectl describe pod <pod-name> | grep -A 2 "Limits"
# Compare requests vs limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'
What to look for:
- Pods with CPU usage at limit (being throttled)
- High latency during high CPU usage
- Application timeouts
- Request/limit mismatches
Signs of throttling:
- CPU usage hitting the limit
- Slow response times under load
- Requests timing out
5. Memory Pressure
Why it matters:
- Can cause OOMKilled errors
- Leads to pod evictions
- Causes application instability
- Affects overall cluster health
How to check:
# Memory usage
kubectl top pods --sort-by=memory
# Memory requests and limits
kubectl describe pod <pod-name> | grep -A 5 "Requests\|Limits"
# Check for OOMKilled
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
What to look for:
- Memory usage approaching limits
- OOMKilled in events
- Pods being evicted
- Memory spikes
Monitoring Commands Cheat Sheet
Basic Monitoring
# All pods with resource usage
kubectl top pods --all-namespaces
# Pods by status
kubectl get pods --all-namespaces --field-selector=status.phase=Running
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
kubectl get pods --all-namespaces --field-selector=status.phase=Failed
# Watch pods in real-time
kubectl get pods -w --all-namespaces
Detailed Analysis
# Pod with restarts
kubectl get pods --all-namespaces -o wide | grep -v "0/"
# Pods in CrashLoopBackOff
kubectl get pods --all-namespaces | grep CrashLoopBackOff
# Pods not ready
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
# Resource requests vs usage
kubectl top pods --all-namespaces && kubectl get pods --all-namespaces -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits}'
Event Monitoring
# Recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
# Events for specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>
# Warning events only
kubectl get events --all-namespaces --field-selector type=Warning
Setting Up Resource Requests and Limits
Why They Matter
Requests:
- Reserve resources for the pod
- Used by scheduler to place pods
- Guaranteed minimum resources
Limits:
- Maximum resources pod can use
- Prevents one pod from consuming all resources
- Triggers throttling/OOMKilled when exceeded
Example Configuration
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: my-app:latest
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Right-Sizing Resources
Process:
- Deploy without limits initially
- Monitor actual usage over time
- Set requests to average usage
- Set limits to peak usage + buffer (20-30%)
- Monitor and adjust
# Monitor current usage
kubectl top pods --containers
# Check what's currently set
kubectl describe pod <pod-name> | grep -A 4 "Requests\|Limits"
Identifying Common Issues
High Restart Count
Check:
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --field-selector involvedObject.name=<pod-name>
Common causes:
- Liveness probe too aggressive
- Application crash on startup
- Out of memory
- Missing configuration
CPU Throttling
Symptoms:
- Slow response times
- Timeouts under load
- High latency
Solution:
resources:
limits:
cpu: "500m" # Increase limit
Memory Leaks
Symptoms:
- Memory usage gradually increasing
- Pods eventually OOMKilled
- Restarts don't help
Investigation:
# Monitor memory over time
watch -n 5 'kubectl top pod <pod-name>'
# Check for memory leaks in application logs
kubectl logs <pod-name> | grep -i memory
Resource Starvation
Symptoms:
- Pods in Pending state
- "Insufficient cpu" or "Insufficient memory" events
- Pods can't be scheduled
Check:
kubectl describe node <node-name>
kubectl top nodes
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
Best Practices
1. Always Set Resource Limits
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
2. Monitor Continuously
Set up alerts for:
- Restart count > 3 in 5 minutes
- CPU usage > 80% of limit
- Memory usage > 90% of limit
- Pods not ready for > 5 minutes
3. Use HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
4. Regular Review
- Weekly resource usage review
- Monthly right-sizing exercise
- Quarterly capacity planning
Monitoring Tools
Built-in (kubectl)
kubectl top: Basic resource usagekubectl get pods: Status and restartskubectl describe: Detailed informationkubectl logs: Application logskubectl get events: Cluster events
Recommended Tools
- Prometheus + Grafana: Metrics and dashboards
- Datadog: Full observability platform
- New Relic: APM and infrastructure monitoring
- kubectl-cost: Cost analysis
Key Metrics Summary
| Metric | Why Watch | What to Do |
|---|---|---|
| CPU Usage | Throttling, performance | Increase limits if hitting ceiling |
| Memory Usage | OOMKilled, evictions | Investigate leaks, adjust limits |
| Restart Count | Stability issues | Check logs, fix root cause |
| Container State | Health status | Investigate non-running states |
| Resource Limits | Capacity planning | Right-size based on usage |
Quick Health Check Script
#!/bin/bash
echo "=== Resource Usage ==="
kubectl top pods --all-namespaces 2>/dev/null || echo "metrics-server not available"
echo -e "\n=== Pods with Restarts ==="
kubectl get pods --all-namespaces | awk 'NR==1 || $4>0'
echo -e "\n=== Problem Pods ==="
kubectl get pods --all-namespaces | grep -E "CrashLoopBackOff|Error|ImagePullBackOff"
echo -e "\n=== Pending Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
Monitoring pods proactively helps you maintain reliable applications and catch issues before they impact users!