Smart Alerting & Notifications
Good alerting saves you from incidents. Bad alerting keeps you awake with false alarms. This guide shows you how to create actionable alerts that actually help instead of overwhelm.
Alerting Principles
1. Alert on Symptoms, Not Causes
Bad: Alert on "high CPU usage"
Good: Alert on "response time > 500ms for 5 minutes"
Why:
- Symptoms tell you users are affected
- Causes might not impact users
- Focus on what matters
2. Alert on What You Can Action
Bad: Alert on "node disk usage > 80%"
Good: Alert on "node disk usage > 85% for 10 minutes"
Why:
- Temporary spikes are noise
- Sustained issues need attention
- Actionable alerts = clear response
3. Use Alert Severity Appropriately
Critical: Users are affected, action needed immediately
Warning: Potential issue, investigate soon
Info: Notable event, no immediate action
4. Reduce Noise
- Group related alerts
- Suppress during known maintenance
- Use alert fatigue thresholds
- Route non-critical to different channels
What to Alert On
Application-Level Alerts
High Priority:
- Error rate spike (> 1% for 5 minutes)
- Response time degradation (p95 > threshold)
- Availability drop (health checks failing)
- Critical business metrics (orders failing, payments failing)
Medium Priority:
- Warning rate increase
- Resource usage approaching limits
- Dependency health issues
Infrastructure Alerts
High Priority:
- Pods crash looping
- Nodes not ready
- Control plane components down
- Out of capacity (pods can't be scheduled)
Medium Priority:
- High resource usage
- Disk space warning
- Network issues
Business-Critical Alerts
High Priority:
- Revenue-generating features down
- Customer-facing services unavailable
- Security incidents
- Data loss risks
Alert Examples
Example 1: Pod CrashLoopBackOff
Alert:
groups:
- name: kubernetes.pods
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in the last 15 minutes"
Why it works:
- Only alerts on sustained crashes (5 minutes)
- Includes context (namespace, pod name)
- Critical severity = immediate attention
Example 2: High Error Rate
Alert:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
Why it works:
- Percentage based (scales with traffic)
- 5-minute window prevents false positives
- Warning level (investigate, not critical yet)
Example 3: Node Not Ready
Alert:
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 2 minutes"
Why it works:
- Short duration (nodes should recover quickly)
- Critical (affects pod scheduling)
- Clear description
Example 4: Response Time Degradation
Alert:
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High response time for {{ $labels.service }}"
description: "95th percentile response time is {{ $value }}s for {{ $labels.service }}"
Why it works:
- Uses percentile (p95) - accounts for outliers
- 10-minute window - sustained performance issue
- Warning level - investigate before it becomes critical
Alert Routing
Route by Severity
Critical Alerts:
- Slack: #critical-alerts (with @here)
- PagerDuty: Immediate escalation
- Email: Sent immediately
Warning Alerts:
- Slack: #alerts channel
- PagerDuty: Low urgency
- Email: Daily digest
Info Alerts:
- Slack: #monitoring channel
- No PagerDuty
- Email: Weekly digest
Route by Team
- Frontend team → Frontend alerts
- Backend team → Backend alerts
- Infrastructure team → Cluster alerts
Route by Service
- Critical services → Immediate notification
- Non-critical services → Standard routing
- Development → Dev channel only
Reducing Alert Noise
1. Use Alert Grouping
Group related alerts to reduce noise:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10m
repeat_interval: 12h
Benefits:
- Multiple pod failures = one alert group
- Reduces notification spam
- Easier to triage
2. Suppress During Maintenance
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster']
Why:
- If critical alert fires, suppress warnings
- Reduces duplicate alerts
- Focus on what matters
3. Alert Fatigue Thresholds
Don't alert on every single event:
- alert: PodRestart
expr: increase(kube_pod_container_status_restarts_total[1h]) > 0
# Only alert if > 3 restarts in an hour
expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
4. Time-Based Routing
Route alerts based on time:
routes:
- match:
severity: warning
receiver: 'weekday-only'
# Only route during business hours
Alert Best Practices
1. Include Context
Bad:
Alert: High CPU usage
Good:
Alert: High CPU usage on pod web-app-abc123
Namespace: production
CPU: 850m / 1000m (85%)
Duration: 10 minutes
Node: worker-1
2. Include Runbooks
Every alert should link to a runbook:
annotations:
summary: "Pod crash looping"
description: "Pod {{ $labels.pod }} is crash looping"
runbook_url: "https://wiki.company.com/runbooks/pod-crashloop"
3. Test Alerts
# Test alert rule
promtool test rules alert-rules.yml
# Send test alert
curl -X POST http://alertmanager:9093/api/v1/alerts \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"}
}]'
4. Review and Tune
Regular review:
- Weekly: Review alert frequency
- Monthly: Remove unused alerts
- Quarterly: Review alert thresholds
Metrics to track:
- Alert volume
- False positive rate
- Mean time to acknowledge
- Mean time to resolve
Practical Alert Configuration
Prometheus AlertManager Config
global:
resolve_timeout: 5m
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10m
repeat_interval: 12h
routes:
# Critical alerts - immediate
- match:
severity: critical
receiver: 'critical-team'
continue: true
# Warning alerts - business hours
- match:
severity: warning
receiver: 'warning-team'
routes:
- match_re:
time: '^(09|10|11|12|13|14|15|16|17):'
receiver: 'warning-team-immediate'
- receiver: 'warning-team-delayed'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster']
receivers:
- name: 'critical-team'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#critical-alerts'
title: '🚨 Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-team'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
title: '⚠️ Warning'
Common Alert Patterns
Pattern 1: Rate-Based Alerts
# Alert on increasing error rate
expr: rate(http_requests_total{status="500"}[5m]) > 10
Pattern 2: Threshold Alerts
# Alert on resource usage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
Pattern 3: Absence Alerts
# Alert when metric disappears
expr: up{job="my-service"} == 0
Pattern 4: Comparison Alerts
# Alert when current vs historical
expr: |
(avg_over_time(metric[1h]) - avg_over_time(metric[6h]))
/ avg_over_time(metric[6h]) > 0.2
Monitoring Alert Health
Track Alert Metrics
- Alert volume over time
- Alert resolution time
- False positive rate
- Alert acknowledgment time
Alert on Alerts
# Alert if too many alerts firing
- alert: TooManyAlerts
expr: count(ALERTS{alertstate="firing"}) > 50
annotations:
summary: "Too many alerts firing"
Key Takeaways
- Alert on symptoms users experience, not just technical metrics
- Use appropriate severity levels - not everything is critical
- Include context in alert messages
- Group related alerts to reduce noise
- Test your alerts before production
- Review regularly and tune thresholds
- Link to runbooks for faster resolution
- Route intelligently based on severity and team
- Monitor alert health itself
- Start simple, add complexity only when needed
Good alerting is an art. Start with the basics, measure what happens, and continuously improve. Your future self (and your team) will thank you!