Prometheus retention policies directly impact storage costs. By optimizing retention periods, implementing downsampling, and managing data lifecycle effectively, you can significantly reduce storage costs while maintaining necessary monitoring capabilities. This comprehensive guide covers everything you need to know about optimizing Prometheus retention policies for cost reduction.
Understanding Prometheus Retention
What is Retention Policy?
Retention policy determines:
- Data Lifetime: How long metrics are stored
- Storage Costs: Direct impact on storage expenses
- Query Performance: Historical data availability
- Compliance: Data retention requirements
Retention Policy Impact on Costs
Storage Cost Factors:
- Data Volume: Amount of metrics collected
- Retention Period: How long data is kept
- Replication: Storage replication overhead
- Compression: Storage compression efficiency
Cost Calculation:
Monthly Cost = (Data Volume × Retention Days × Replication Factor × Storage Cost per GB) / 30
Why Retention Optimization Matters
Cost Benefits:
- Reduced Storage: Lower storage costs
- Improved Performance: Faster queries
- Better Scalability: Handle more metrics
- Resource Efficiency: Optimal resource usage
Prerequisites
Before optimizing retention, ensure:
- Prometheus Installed: With retention configuration access
- Storage Metrics: Current storage usage data
- Cost Analysis: Understanding of storage costs
- Compliance Requirements: Data retention requirements
- Monitoring Needs: Required historical data period
Step-by-Step: Retention Policy Configuration
Step 1: Analyze Current Storage Usage
Analyze current storage usage:
# Check Prometheus storage size
kubectl exec -n monitoring prometheus-0 -- du -sh /var/prometheus
# Check metrics cardinality
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.stats'
# List retention settings
kubectl get prometheus prometheus -o yaml | grep retention
# Check storage usage by metric
curl -s http://prometheus:9090/api/v1/label/__name__/values | wc -l
Step 2: Configure Retention Period
Set retention period:
# prometheus-retention.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
retention: 15d # Keep data for 15 days (default: 15d)
# Retention size limit (optional)
retentionSize: 50GB
# Storage configuration
storage:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
Retention Period Options:
1h,2h,6h,12h: Short-term retention1d,3d,7d,15d: Medium-term retention30d,60d, 90d: Long-term retention1y: Very long-term retention
Step 3: Implement Tiered Retention
Use different retention for different metrics:
# prometheus-tiered-retention.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
# Default retention
retention: 7d
# Remote write for long-term storage
remoteWrite:
- url: "https://remote-storage/api/v1/write"
name: long-term-storage
queueConfig:
maxSamplesPerSend: 1000
batchSendDeadline: 5s
writeRelabelConfigs:
# Only send critical metrics to long-term storage
- sourceLabels: [__name__]
regex: 'up|kubernetes_pod_status_phase|http_request_duration_seconds'
action: keep
Step 4: Configure Data Lifecycle
Implement data lifecycle management:
# prometheus-lifecycle.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
# Short retention for high-cardinality metrics
retention: 3d
# Compression
walCompression: true
# Storage optimization
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi # Reduced from 100Gi
Advanced Retention Strategies
Strategy 1: Downsampling
Downsample historical data:
# prometheus-downsampling.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
retention: 7d # Keep raw data for 7 days
# Remote write to Thanos for downsampling
remoteWrite:
- url: "http://thanos-sidecar:10908/api/v1/receive"
name: thanos
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: '.*'
action: keep
Configure Thanos downsampling:
# thanos-compactor.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-compactor
spec:
template:
spec:
containers:
- name: compactor
image: thanosio/thanos:latest
args:
- compact
- --data-dir=/var/thanos/compact
- --objstore.config-file=/etc/thanos/objstore.yaml
- --retention.resolution-raw=7d # Keep raw for 7 days
- --retention.resolution-5m=30d # Keep 5m downsampled for 30 days
- --retention.resolution-1h=90d # Keep 1h downsampled for 90 days
- --delete-delay=48h
Strategy 2: Metric Filtering
Filter high-cardinality metrics:
# prometheus-filtering.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
retention: 7d
# Filter metrics before ingestion
remoteWrite:
- url: "https://remote-storage/api/v1/write"
writeRelabelConfigs:
# Drop high-cardinality labels
- sourceLabels: [__name__]
regex: 'container_.*'
action: drop
# Keep only important metrics
- sourceLabels: [__name__]
regex: 'up|http_request_total|cpu_usage'
action: keep
Strategy 3: Selective Retention
Different retention for different metric types:
# prometheus-selective-retention.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
retention: 7d # Default retention
# Remote write with different retention policies
remoteWrite:
# Critical metrics - long retention
- url: "https://storage-critical/api/v1/write"
name: critical-metrics
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'up|kubernetes_cluster_status'
action: keep
# Standard metrics - medium retention
- url: "https://storage-standard/api/v1/write"
name: standard-metrics
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'http_request.*|cpu_usage.*'
action: keep
# Debug metrics - short retention
- url: "https://storage-debug/api/v1/write"
name: debug-metrics
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'debug_.*|trace_.*'
action: keep
Cost Optimization Best Practices
1. Right-Size Retention
Match retention to needs:
- Operational Metrics: 7-15 days
- Business Metrics: 30-90 days
- Compliance Metrics: As required
- Debug Metrics: 1-3 days
2. Implement Downsampling
Downsample historical data:
- Keep raw data: 7 days
- 5-minute samples: 30 days
- 1-hour samples: 90 days
- Daily samples: 1 year
3. Filter High-Cardinality Metrics
Reduce cardinality:
- Drop unused metrics
- Remove high-cardinality labels
- Aggregate metrics
- Use recording rules
4. Use Remote Storage
Offload to cost-effective storage:
- Use object storage (S3, GCS)
- Compress data
- Use lifecycle policies
- Archive old data
Storage Cost Calculation
Example Calculation
Current Setup:
- Data ingestion: 100GB/day
- Retention: 30 days
- Storage cost: $0.10/GB/month
- Replication: 3x
Cost:
Daily Storage = 100GB
Monthly Storage = 100GB × 30 days = 3TB
With Replication = 3TB × 3 = 9TB
Monthly Cost = 9TB × $0.10/GB = $900/month
Optimized Setup:
- Retention: 7 days (raw) + 30 days (downsampled)
- Compression: 10:1 ratio
- Replication: 2x
Cost:
Raw Storage = 100GB × 7 days × 2 = 1.4TB
Downsampled Storage = 10GB × 30 days × 2 = 600GB
Total = 2TB
Monthly Cost = 2TB × $0.10/GB = $200/month
Savings = $700/month (78% reduction)
Monitoring Retention Costs
Track Storage Usage
Monitor storage costs:
# Prometheus storage usage
prometheus_tsdb_storage_blocks_bytes
# Storage by retention policy
sum(prometheus_tsdb_storage_blocks_bytes) by (retention)
# Storage growth rate
rate(prometheus_tsdb_storage_blocks_bytes[1h])
# Estimated monthly cost
sum(prometheus_tsdb_storage_blocks_bytes) * 0.10 / 1e9 * 30
Troubleshooting
Issue 1: High Storage Costs
Symptoms: Storage costs are too high.
Solutions:
- Reduce retention period
- Implement downsampling
- Filter high-cardinality metrics
- Use remote storage
Issue 2: Data Not Retained Long Enough
Symptoms: Need data longer than retention allows.
Solutions:
- Use remote storage
- Implement downsampling
- Use tiered retention
- Archive old data
Conclusion
Retention optimization reduces costs. By following this guide:
- Configuration: Retention policy setup
- Downsampling: Long-term storage strategies
- Filtering: Metric filtering for cost reduction
- Calculation: Cost calculation and optimization
- Best Practices: Production strategies
Key Takeaways:
- Match retention to actual needs
- Implement downsampling for long-term data
- Filter high-cardinality metrics
- Use remote storage for cost savings
- Monitor storage costs continuously
Next Steps:
- Analyze current storage usage
- Define retention requirements
- Configure retention policies
- Implement downsampling
- Monitor and optimize
With optimized retention policies, you can significantly reduce storage costs while maintaining necessary monitoring capabilities.