Readiness & Liveness Probes
Probes are Kubernetes' way of checking if your containers are healthy and ready to serve traffic. Properly configured probes prevent unnecessary restarts and ensure traffic only goes to ready pods.
Types of Probes
Liveness Probe
Purpose: Determines if the container is running properly.
What happens if it fails:
- Kubernetes kills the container
- Container restarts (if restart policy allows)
- New container instance is created
Use when:
- Application can get into a broken state
- Restarting might fix the issue
- Deadlock or infinite loop scenarios
Readiness Probe
Purpose: Determines if the container is ready to accept traffic.
What happens if it fails:
- Pod is removed from Service endpoints
- Traffic stops being routed to this pod
- Container keeps running (doesn't restart)
Use when:
- Application needs time to start up
- Application temporarily can't serve traffic
- Database connections need to be established
Probe Mechanisms
HTTP Get Probe
Most common for web applications:
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
Best practices:
- Use a dedicated health endpoint (not your main API)
- Keep the endpoint lightweight (no database queries)
- Return 200 for healthy, anything else for unhealthy
TCP Socket Probe
For non-HTTP services:
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 30
periodSeconds: 10
Use when:
- Database services
- Non-HTTP applications
- Services that don't expose HTTP endpoints
Exec Probe
Run a command inside the container:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "pgrep -f myapp || exit 1"
initialDelaySeconds: 30
periodSeconds: 10
Use sparingly:
- More expensive than HTTP/TCP
- Requires shell access
- Harder to debug
Probe Parameters
initialDelaySeconds
Purpose: Wait time before first probe after container starts.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Wait 30 seconds after startup
Best practices:
- Set based on actual startup time
- Add buffer (startup time + 10-20 seconds)
- Too low = premature failures
- Too high = slow detection of issues
periodSeconds
Purpose: How often to perform the probe.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10 # Check every 10 seconds
Best practices:
- Balance between detection speed and load
- Common: 5-30 seconds
- For liveness: Can be longer (10-30s)
- For readiness: Can be shorter (5-10s)
timeoutSeconds
Purpose: Time to wait for probe response.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 5 # Fail if no response in 5 seconds
Best practices:
- Should be less than periodSeconds
- Typical: 1-5 seconds
- Account for network latency
- Too low = false failures under load
successThreshold
Purpose: Number of consecutive successes needed to mark probe as successful.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
successThreshold: 1 # One success = healthy (default)
Best practices:
- Default is 1 (immediate)
- Increase to 2-3 for flaky endpoints
- Helps avoid flapping
failureThreshold
Purpose: Number of consecutive failures before marking as failed.
Example:
livenessProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 3 # Fail after 3 consecutive failures
Best practices:
- Default is 3
- Higher = more tolerant of temporary failures
- Lower = faster failure detection
- Balance between false positives and detection speed
Complete Example
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
containers:
- name: app
image: my-app:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Common Patterns
Pattern 1: Separate Endpoints
Best practice: Use different endpoints for liveness and readiness.
livenessProbe:
httpGet:
path: /health/live
port: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
Why:
- Liveness: Check if process is alive (simple)
- Readiness: Check if ready to serve (check dependencies)
Pattern 2: Startup Probe
Use for: Slow-starting applications.
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Allow up to 2.5 minutes to start
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
Why:
- Prevents liveness probe from killing slow-starting apps
- Once startup succeeds, liveness takes over
- Use for apps that take >30 seconds to start
Pattern 3: Conservative Readiness
Use for: Applications with external dependencies.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 1 # Remove from service immediately
successThreshold: 2 # Require 2 successes before adding back
Why:
- Prevents sending traffic to unhealthy pods
- Avoids flapping (rapid add/remove cycles)
Avoiding Common Pitfalls
Pitfall 1: Probes Too Aggressive
Problem:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5 # Too short!
periodSeconds: 1 # Too frequent!
Issues:
- App killed before fully started
- High load from probe checks
- False restarts
Solution:
- Increase initialDelaySeconds to actual startup time
- Use periodSeconds of 10+ seconds
- Add startupProbe for slow starters
Pitfall 2: Heavy Health Checks
Problem:
# Health endpoint performs database query
readinessProbe:
httpGet:
path: /health # This queries the database!
Issues:
- Database overload from health checks
- Slow responses cause probe failures
- Cascading failures
Solution:
- Health endpoint should be lightweight
- Check dependencies separately
- Cache health status if needed
Pitfall 3: Same Endpoint for Both
Problem:
livenessProbe:
httpGet:
path: /health
readinessProbe:
httpGet:
path: /health # Same endpoint!
Issues:
- Can't differentiate between "dead" and "not ready"
- Inappropriate restarts
- Traffic routing issues
Solution:
- Use
/health/livefor liveness - Use
/health/readyfor readiness - Different logic in each endpoint
Pitfall 4: Wrong Failure Threshold
Problem:
livenessProbe:
failureThreshold: 1 # Too sensitive!
Issues:
- Single network hiccup = restart
- False positives cause unnecessary restarts
- Application instability
Solution:
- Use failureThreshold: 3 (default)
- Increase for flaky networks
- Monitor probe failures to tune
Health Endpoint Implementation
Example: Go Application
func healthLive(w http.ResponseWriter, r *http.Request) {
// Simple check: is the process running?
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
func healthReady(w http.ResponseWriter, r *http.Request) {
// Check: can we serve traffic?
// - Database connection OK?
// - External dependencies available?
if !db.IsConnected() {
http.Error(w, "DB not connected", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
Example: Node.js Application
// Liveness: simple process check
app.get('/health/live', (req, res) => {
res.status(200).send('OK');
});
// Readiness: check dependencies
app.get('/health/ready', async (req, res) => {
try {
await db.ping();
await redis.ping();
res.status(200).send('OK');
} catch (error) {
res.status(503).send('Not ready');
}
});
Testing Probes
Test Liveness Probe
# Create pod
kubectl apply -f pod-with-probes.yaml
# Watch pod status
kubectl get pods -w
# Check probe status
kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness"
# Force probe failure (make endpoint return 500)
kubectl exec -it <pod-name> -- curl http://localhost:8080/health -X POST -d "fail=true"
Monitor Probe Failures
# View events
kubectl get events --field-selector involvedObject.name=<pod-name>
# Check for probe failures
kubectl describe pod <pod-name> | grep -i "probe\|unhealthy"
# View container logs
kubectl logs <pod-name>
Troubleshooting
Probe Failing Immediately
Check:
kubectl describe pod <pod-name>
kubectl logs <pod-name>
Common causes:
- Wrong path or port
- Application not listening on that port
- Firewall/network issues
- Application crash on startup
Probe Timing Out
Check:
# Test endpoint manually
kubectl exec -it <pod-name> -- curl http://localhost:8080/health
# Check response time
kubectl exec -it <pod-name> -- time curl http://localhost:8080/health
Solutions:
- Increase timeoutSeconds
- Optimize health endpoint
- Check application performance
Pod Flapping (Rapid Ready/NotReady)
Symptoms:
- Pod repeatedly becomes ready/unready
- Service endpoints changing constantly
Solutions:
- Increase successThreshold for readiness
- Fix flaky health endpoint
- Check for resource constraints
- Verify external dependencies
Best Practices Summary
- ✅ Use separate endpoints for liveness and readiness
- ✅ Make health endpoints lightweight and fast
- ✅ Set appropriate initialDelaySeconds
- ✅ Use startupProbe for slow-starting apps
- ✅ Avoid heavy operations in probe endpoints
- ✅ Test probes under load
- ✅ Monitor probe failure rates
- ✅ Use failureThreshold > 1 to avoid false positives
- ✅ Tune periodSeconds based on application needs
- ✅ Document probe behavior in application code
Quick Reference
# Standard configuration
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 2
Properly configured probes are essential for reliable Kubernetes applications. They ensure traffic only goes to healthy pods and that unhealthy containers are restarted when needed!