Troubleshooting Kubernetes Pods Killed by OOM

Troubleshooting Kubernetes Pods Killed by OOM (Out of Memory)

When a pod container in Kubernetes enters the OOMKilled state, it means the kernel's Out-Of-Memory (OOM) killer terminated the container because it exceeded its memory limits. This is a runtime failure - the container started successfully but used more memory than allowed.

You can identify this state using:

kubectl get pods

NAME               READY   STATUS      RESTARTS   AGE
my-app-pod         0/1     OOMKilled   3          4m

Confirm with:

kubectl describe pod <pod-name>
# Look for:
# State:          Terminated
# Reason:         OOMKilled
# Exit Code:      137

Impact of OOMKilled State

Container is terminated abruptly, and Kubernetes may restart it repeatedly
Application becomes unavailable or behaves unpredictably
Deployment rollouts may hang due to constant restarts
Cluster resources are wasted if the pod keeps restarting
Continuous OOMKills can impact node stability and cause other pods to be evicted

Bottom line: OOMKilled means your application needs more memory or is leaking memory and must be tuned or fixed.

Common Causes and Solutions

1. Memory Limit Too Low

Symptom: Container's memory limit is lower than what the application actually requires.

Diagnosis:

kubectl describe pod <pod-name> | grep -A 5 "Limits:"
kubectl top pod <pod-name>

Solutions:

Increase memory limit in deployment
Monitor actual memory usage over time
Set limits based on peak usage, not average
Consider setting limits 20-30% higher than peak observed usage

2. Memory Leaks

Symptom: Application gradually consumes all available memory due to poor memory management.

Diagnosis:

kubectl logs <pod-name> | grep -i "memory\|outofmemory\|heap"
kubectl top pod <pod-name> --containers
# Monitor memory usage over time - if it keeps increasing, there's a leak

Solutions:

Fix application memory leaks in code
Implement proper resource cleanup
Use memory profiling tools (heap analyzers)
Restart pods periodically if leaks can't be fixed immediately
Consider implementing memory limits with automatic restarts

3. Unexpected Load or Traffic Spikes

Symptom: Increased usage causes application to exceed memory allocation.

Diagnosis:

kubectl top pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name> | grep -i oom

Solutions:

Increase memory limits for traffic spikes
Implement horizontal pod autoscaling
Add memory buffers for peak loads
Use request throttling to limit memory usage
Monitor and set alerts for memory usage patterns

4. JVM Heap Misconfiguration

Symptom: For Java apps, JVM heap settings are too close to container memory limit.

Diagnosis:

kubectl exec <pod-name> -- env | grep -i jvm\|heap\|xmx
kubectl describe pod <pod-name> | grep -i memory

Solutions:

Set JVM heap size lower than container memory limit
Leave headroom for JVM overhead (usually 20-25% of limit)
Configure -XX:MaxRAMPercentage instead of fixed heap sizes
Example: For 512Mi limit, use max heap of ~384Mi

5. Shared Memory or Caching Issues

Symptom: In-memory caching, shared memory, or tmpfs volumes consume unaccounted memory.

Diagnosis:

kubectl exec <pod-name> -- df -h
kubectl exec <pod-name> -- cat /proc/meminfo | grep -i shmem

Solutions:

Account for tmpfs mounts in memory limits
Reduce cache sizes in applications
Use Redis or external cache instead of in-memory
Monitor /dev/shm usage if using shared memory

Step-by-Step Troubleshooting

Step 1: Confirm OOMKilled

kubectl describe pod <pod-name> | grep -A 10 "Last State"
# Should show: Reason: OOMKilled, Exit Code: 137

Step 2: Check Current Memory Usage

kubectl top pod <pod-name>
kubectl top pod <pod-name> --containers

Step 3: Review Memory Limits

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

Step 4: Analyze Memory Patterns

# Check if memory grows over time (memory leak indicator)
kubectl top pod <pod-name> --containers --containers
# Run this multiple times and observe growth

Step 5: Check Container Logs

kubectl logs <pod-name>
kubectl logs <pod-name> --previous
# Look for memory-related errors or warnings

Quick Fixes

Immediate Actions

Increase memory limit temporarily:

resources:
  limits:
    memory: "1Gi"  # Increase from current limit

Reduce memory pressure:
- Scale down other pods on the same node
- Evict low-priority pods
- Add more nodes to cluster

Implement restart policy:

restartPolicy: OnFailure  # For jobs
# Or let deployment handle restarts

Add memory requests (if missing):

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Best Practices to Prevent OOMKilled

Set appropriate memory limits: Based on actual usage patterns, not guesses
Monitor memory usage: Use tools like Prometheus to track memory over time
Implement memory requests: Help scheduler make better placement decisions
Fix memory leaks: Address application bugs causing gradual memory growth
Right-size resources: Regularly review and adjust limits based on metrics
Use memory-aware languages: For Java, configure heap properly
Implement graceful degradation: Reduce functionality under memory pressure
Set up alerts: Monitor for memory usage approaching limits

Resource Configuration Examples

Correct Memory Configuration

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"  # Leave 20-30% buffer above peak usage
    cpu: "500m"

Java Application Configuration

env:
- name: JAVA_OPTS
  value: "-Xmx384m -XX:MaxRAMPercentage=75.0"
resources:
  limits:
    memory: "512Mi"  # Heap (384Mi) + overhead (~128Mi)

Related Resources

Conclusion

OOMKilled pods are usually caused by insufficient memory limits, memory leaks, or unexpected load. Start by increasing memory limits temporarily, then investigate the root cause. Monitor memory usage patterns and set limits based on actual peak usage with appropriate buffers.

Remember: Exit code 137 typically indicates OOMKilled (128 + 9, where 9 is SIGKILL).