Day-2 Checklist

Day-2 Checklist

A practical weekly checklist for maintaining your Kubernetes cluster. Print this out, keep it handy, and use it to ensure nothing falls through the cracks.

Weekly Checklist

Cluster Health

  • Check node status

    kubectl get nodes
    
    • All nodes should be Ready
    • No nodes in NotReady for extended time
  • Review node conditions

    kubectl describe nodes | grep -A 5 "Conditions"
    
    • No MemoryPressure
    • No DiskPressure
    • No PIDPressure
  • Check control plane health

    kubectl get pods -n kube-system
    kubectl get --raw /healthz
    
    • All control plane pods running
    • API server healthy
  • Review pending pods

    kubectl get pods --all-namespaces --field-selector=status.phase=Pending
    
    • Investigate any pending pods
    • Check for resource constraints

Application Health

  • Check for crash looping pods

    kubectl get pods --all-namespaces | grep -E "CrashLoopBackOff|Error"
    
    • Investigate and fix crash loops
    • Review logs for root cause
  • Review pod restarts

    kubectl get pods --all-namespaces | awk '$4>0'
    
    • Investigate pods with frequent restarts
    • Check if restarts are expected
  • Check resource usage

    kubectl top pods --all-namespaces
    kubectl top nodes
    
    • Identify pods using excessive resources
    • Check for throttling/OOMKilled
  • Review application logs

    # Check error logs
    kubectl logs -l app=my-app --tail=100 | grep -i error
    
    • Look for error patterns
    • Check for warnings

Performance Monitoring

  • Review response times

    • Check p95/p99 latency metrics
    • Compare with SLO targets
    • Investigate if above thresholds
  • Check error rates

    # If using Prometheus
    rate(http_requests_total{status=~"5.."}[5m])
    
    • Error rate should be < 1%
    • Investigate spikes
  • Review throughput

    • Check request volume trends
    • Identify peak usage periods
    • Plan capacity accordingly

Capacity Planning

  • Check node utilization

    kubectl top nodes
    kubectl describe nodes | grep -A 2 "Capacity\|Allocatable"
    
    • CPU utilization: Target 70-80%
    • Memory utilization: Target 70-80%
    • Plan scaling if > 90%
  • Review resource requests/limits

    kubectl describe pod <pod-name> | grep -A 4 "Requests\|Limits"
    
    • Compare requests vs actual usage
    • Identify over-provisioned pods
    • Identify under-provisioned pods
  • Check for scheduling issues

    kubectl get events --all-namespaces | grep -i "insufficient"
    
    • Look for "Insufficient cpu" or "Insufficient memory"
    • Plan node scaling if frequent

Security

  • Review RBAC

    kubectl get roles,rolebindings --all-namespaces
    kubectl get clusterroles,clusterrolebindings
    
    • Check for overly permissive roles
    • Review service account permissions
    • Remove unused bindings
  • Check for exposed secrets

    kubectl get secrets --all-namespaces
    
    • Review secret usage
    • Rotate secrets if needed
    • Check for secrets in logs
  • Review network policies

    kubectl get networkpolicies --all-namespaces
    
    • Ensure policies are applied
    • Review and tighten rules
  • Check image vulnerabilities

    • Scan container images
    • Update base images if needed
    • Review security advisories
  • Review pod security

    kubectl get pods --all-namespaces -o jsonpath='{.items[*].spec.securityContext}'
    
    • Ensure pods don't run as root
    • Check security contexts
    • Review privilege escalations

Backups & Disaster Recovery

  • Verify etcd backups

    • Confirm backups are running
    • Test restore procedure
    • Verify backup integrity
  • Review application backups

    • Database backups verified
    • Application data backed up
    • Backup retention policy reviewed
  • Test disaster recovery

    • Document recovery procedures
    • Test restore from backup (quarterly)
    • Update runbooks if needed

Configuration Management

  • Review ConfigMaps and Secrets

    kubectl get configmaps,secrets --all-namespaces
    
    • Remove unused ConfigMaps
    • Remove unused Secrets
    • Review for sensitive data
  • Check for deprecated APIs

    kubectl get --raw /api/v1 | grep -i deprecated
    
    • Review deprecation warnings
    • Plan API version migrations
  • Review resource versions

    • Check Kubernetes version compatibility
    • Plan upgrades if needed

Cost Optimization

  • Review resource waste

    kubectl top pods --all-namespaces
    
    • Identify over-provisioned pods
    • Identify idle pods
    • Right-size resources
  • Check for unused resources

    kubectl get deployments,services,pvc --all-namespaces
    
    • Remove unused deployments
    • Remove unused services
    • Clean up unused PVCs
  • Review node costs

    • Check node utilization
    • Consider spot instances for non-critical
    • Review reserved instance usage

Monitoring & Alerting

  • Review alert history

    • Check alert volume
    • Review false positives
    • Tune alert thresholds if needed
  • Verify monitoring coverage

    • All critical services monitored
    • Key metrics being collected
    • Dashboards up to date
  • Test alerting

    • Verify alerts are firing
    • Check alert routing
    • Update runbooks if needed

Documentation

  • Update runbooks

    • Document new procedures
    • Update outdated information
    • Review incident responses
  • Review architecture docs

    • Update service diagrams
    • Document configuration changes
    • Update contact information
  • Review change log

    • Document recent changes
    • Note any issues encountered
    • Update lessons learned

Monthly Checklist

Capacity Review

  • Review growth trends

    • Analyze resource usage trends
    • Project future capacity needs
    • Plan scaling accordingly
  • Review SLOs/SLAs

    • Compare actual vs targets
    • Adjust targets if needed
    • Review with stakeholders

Security Audit

  • Comprehensive RBAC review

    • Audit all roles and bindings
    • Remove unnecessary permissions
    • Document access patterns
  • Vulnerability scanning

    • Scan all container images
    • Review security advisories
    • Update vulnerable components
  • Review network policies

    • Audit all network rules
    • Tighten policies if possible
    • Document network topology

Cost Analysis

  • Detailed cost breakdown

    • Analyze costs by namespace
    • Analyze costs by service
    • Identify optimization opportunities
  • Review resource allocation

    • Right-size all workloads
    • Remove unused resources
    • Optimize node types

Disaster Recovery

  • Test backup restore

    • Restore from etcd backup
    • Test application restore
    • Document any issues
  • Review DR procedures

    • Update recovery procedures
    • Test failover scenarios
    • Update contact lists

Quarterly Checklist

Upgrade Planning

  • Review Kubernetes versions

    • Check current versions
    • Review upgrade paths
    • Plan upgrade schedule
  • Review dependency versions

    • Update base images
    • Review library versions
    • Plan dependency updates

Architecture Review

  • Review architecture

    • Evaluate current architecture
    • Identify improvement opportunities
    • Plan architectural changes
  • Performance optimization

    • Review performance metrics
    • Identify bottlenecks
    • Plan optimizations

Team Training

  • Review team skills
    • Identify training needs
    • Plan knowledge sharing
    • Update documentation

Quick Commands Reference

# Cluster health
kubectl get nodes
kubectl get pods -n kube-system
kubectl get --raw /healthz

# Application health
kubectl get pods --all-namespaces
kubectl top pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Resource usage
kubectl top nodes
kubectl describe nodes | grep -A 5 "Capacity\|Allocatable"

# Security
kubectl get roles,rolebindings --all-namespaces
kubectl get networkpolicies --all-namespaces

# Troubleshooting
kubectl describe pod <pod-name>
kubectl logs <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>

Notes Section

Use this space to document issues, findings, or action items:






Tips

  • Automate what you can: Use tools to automate routine checks
  • Focus on trends: Look for patterns over time, not just current state
  • Document everything: Future you will thank current you
  • Don't skip reviews: Regular reviews prevent small issues becoming big ones
  • Share knowledge: Document findings for the team

Last Review Date: _________________

Reviewed By: _________________

Next Review Date: _________________

Keep this checklist visible and make it part of your routine. Consistent maintenance prevents incidents and reduces stress!