Troubleshooting Kubernetes Pods in Error State
In Kubernetes, a pod enters the Error state when one or more of its containers stop running with a non-zero exit code. This indicates that the process inside the container failed unexpectedly, usually when an application crashes, encounters an exception, or cannot complete its intended task.
You can identify this state using:
kubectl get pods
NAME READY STATUS RESTARTS AGE
data-processing-job 0/1 Error 0 2m
Impact of Error State
When a pod is in Error state, execution failed. Kubernetes won't automatically retry unless a controller (Job, Deployment, ReplicaSet) manages it.
Consequences:
- The task didn't complete successfully
- Data processing, backup, or script execution may be incomplete
- Dependent workloads (Jobs, CronJobs, Deployments) may fail or hang
- Application remains unavailable until fixed
Bottom line: Error state means Kubernetes started the container successfully, but the process inside it failed.
Common Causes and Solutions
1. Application Crash
Symptom: Runtime exception causes the main process to crash.
Diagnosis:
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name> | grep -A 5 "Exit Code:"
Solutions:
- Fix application bugs causing crashes
- Add proper error handling and logging
- Check application dependencies
- Review recent code changes
- Test application locally before deploying
2. Invalid Startup Command
Symptom: Container entrypoint or command fails immediately.
Diagnosis:
kubectl describe pod <pod-name> | grep -A 5 "Command:"
kubectl get pod <pod-name> -o yaml | grep -A 10 "command:"
Solutions:
- Verify entrypoint commands are correct
- Ensure command paths exist in container
- Check for syntax errors in command arrays
- Test commands locally with the same image
- Use absolute paths for commands
3. Dependency Failure
Symptom: Container exits because it can't connect to external dependency.
Diagnosis:
kubectl logs <pod-name> | grep -i "connection\|timeout\|refused\|unreachable"
kubectl get services,endpoints
Solutions:
- Ensure dependent services are running
- Check service DNS names and ports
- Verify network policies allow connections
- Add retry logic for dependencies
- Use init containers to wait for dependencies
4. Permission Issues
Symptom: Process fails due to missing permissions or non-existent files.
Diagnosis:
kubectl logs <pod-name> | grep -i "permission\|denied\|no such file"
kubectl exec <pod-name> -- ls -la /path/to/file
Solutions:
- Fix file permissions in container image
- Ensure required files are present
- Configure securityContext with correct user
- Verify volume mounts are correct
- Check filesystem permissions
5. Exit Code Handling
Symptom: Container command explicitly returns a failure exit code.
Diagnosis:
kubectl describe pod <pod-name> | grep "Exit Code:"
# Exit codes other than 0 indicate failure
Solutions:
- Fix the underlying issue causing non-zero exit
- Review application logic for error handling
- Check script exit codes
- Ensure processes return 0 on success
Step-by-Step Troubleshooting
Step 1: Check Exit Code
kubectl describe pod <pod-name>
# Look for Exit Code in Last State section
Step 2: Examine Logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> --all-containers=true
Step 3: Review Pod Events
kubectl describe pod <pod-name> | grep -A 10 "Events:"
Step 4: Check Container Configuration
kubectl get pod <pod-name> -o yaml
# Review command, args, env, volumes
Step 5: Test Locally
# Run the same image locally to reproduce
docker run <image-name> <command>
Quick Fixes
Immediate Actions
Check logs first: Most errors are visible in logs
kubectl logs <pod-name>Restart pod: If it's managed by a controller
kubectl delete pod <pod-name>Fix configuration: Update deployment with corrected settings
Add debugging: Temporarily use a shell to debug
command: ["/bin/sh"] args: ["-c", "while true; do sleep 3600; done"]
Best Practices to Prevent Error State
- Proper error handling: Add try-catch blocks and error logging
- Validate configuration: Check config at startup
- Health checks: Implement liveness and readiness probes
- Graceful shutdown: Handle SIGTERM properly
- Dependency checks: Verify dependencies are available
- Testing: Test containers locally before deploying
- Monitoring: Set up alerts for error states
Exit Code Reference
Common exit codes and meanings:
- 0: Success
- 1: General error
- 2: Misuse of shell command
- 126: Command cannot execute
- 127: Command not found
- 128+N: Process terminated by signal N
- 130: Process terminated by SIGINT (Ctrl+C)
- 137: Process killed (SIGKILL)
Related Resources
- Troubleshooting CrashLoopBackOff Pods
- Debugging Kubernetes Applications
- Basic Troubleshooting Commands
Conclusion
Pods in Error state indicate application or configuration failures. Start by checking logs with kubectl logs, then examine the exit code and pod events. Most issues can be resolved by fixing the root cause identified in the logs.
Remember: Error state means the container started but the process failed. Always check logs first!