Troubleshooting Kubernetes Pods in Error State

In Kubernetes, a pod enters the Error state when one or more of its containers stop running with a non-zero exit code. This indicates that the process inside the container failed unexpectedly, usually when an application crashes, encounters an exception, or cannot complete its intended task.

You can identify this state using:

kubectl get pods

NAME                  READY   STATUS    RESTARTS   AGE
data-processing-job   0/1     Error     0          2m

Impact of Error State

When a pod is in Error state, execution failed. Kubernetes won't automatically retry unless a controller (Job, Deployment, ReplicaSet) manages it.

Consequences:

The task didn't complete successfully
Data processing, backup, or script execution may be incomplete
Dependent workloads (Jobs, CronJobs, Deployments) may fail or hang
Application remains unavailable until fixed

Bottom line: Error state means Kubernetes started the container successfully, but the process inside it failed.

Common Causes and Solutions

1. Application Crash

Symptom: Runtime exception causes the main process to crash.

Diagnosis:

kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name> | grep -A 5 "Exit Code:"

Solutions:

Fix application bugs causing crashes
Add proper error handling and logging
Check application dependencies
Review recent code changes
Test application locally before deploying

2. Invalid Startup Command

Symptom: Container entrypoint or command fails immediately.

Diagnosis:

kubectl describe pod <pod-name> | grep -A 5 "Command:"
kubectl get pod <pod-name> -o yaml | grep -A 10 "command:"

Solutions:

Verify entrypoint commands are correct
Ensure command paths exist in container
Check for syntax errors in command arrays
Test commands locally with the same image
Use absolute paths for commands

3. Dependency Failure

Symptom: Container exits because it can't connect to external dependency.

Diagnosis:

kubectl logs <pod-name> | grep -i "connection\|timeout\|refused\|unreachable"
kubectl get services,endpoints

Solutions:

Ensure dependent services are running
Check service DNS names and ports
Verify network policies allow connections
Add retry logic for dependencies
Use init containers to wait for dependencies

4. Permission Issues

Symptom: Process fails due to missing permissions or non-existent files.

Diagnosis:

kubectl logs <pod-name> | grep -i "permission\|denied\|no such file"
kubectl exec <pod-name> -- ls -la /path/to/file

Solutions:

Fix file permissions in container image
Ensure required files are present
Configure securityContext with correct user
Verify volume mounts are correct
Check filesystem permissions

5. Exit Code Handling

Symptom: Container command explicitly returns a failure exit code.

Diagnosis:

kubectl describe pod <pod-name> | grep "Exit Code:"
# Exit codes other than 0 indicate failure

Solutions:

Fix the underlying issue causing non-zero exit
Review application logic for error handling
Check script exit codes
Ensure processes return 0 on success

Step-by-Step Troubleshooting

Step 1: Check Exit Code

kubectl describe pod <pod-name>
# Look for Exit Code in Last State section

Step 2: Examine Logs

kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> --all-containers=true

Step 3: Review Pod Events

kubectl describe pod <pod-name> | grep -A 10 "Events:"

Step 4: Check Container Configuration

kubectl get pod <pod-name> -o yaml
# Review command, args, env, volumes

Step 5: Test Locally

# Run the same image locally to reproduce
docker run <image-name> <command>

Quick Fixes

Immediate Actions

Check logs first: Most errors are visible in logs
```
kubectl logs <pod-name>
```
Restart pod: If it's managed by a controller
```
kubectl delete pod <pod-name>
```
Fix configuration: Update deployment with corrected settings

Add debugging: Temporarily use a shell to debug

command: ["/bin/sh"]
args: ["-c", "while true; do sleep 3600; done"]

Best Practices to Prevent Error State

Proper error handling: Add try-catch blocks and error logging
Validate configuration: Check config at startup
Health checks: Implement liveness and readiness probes
Graceful shutdown: Handle SIGTERM properly
Dependency checks: Verify dependencies are available
Testing: Test containers locally before deploying
Monitoring: Set up alerts for error states

Exit Code Reference

Common exit codes and meanings:

0: Success
1: General error
2: Misuse of shell command
126: Command cannot execute
127: Command not found
128+N: Process terminated by signal N
130: Process terminated by SIGINT (Ctrl+C)
137: Process killed (SIGKILL)

Related Resources

Conclusion

Pods in Error state indicate application or configuration failures. Start by checking logs with kubectl logs, then examine the exit code and pod events. Most issues can be resolved by fixing the root cause identified in the logs.

Remember: Error state means the container started but the process failed. Always check logs first!