Agent-almanac plan-capacity
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman/skills/plan-capacity" ~/.claude/skills/pjt222-agent-almanac-plan-capacity-885814 && rm -rf "$T"
i18n/caveman/skills/plan-capacity/SKILL.mdPlan Capacity
Forecast resource needs and prevent saturation through data-driven capacity planning.
When to Use
- Before seasonal traffic spikes (holidays, sales events)
- When planning new feature launches
- During quarterly capacity reviews
- When resource utilization trends upward
- Before budget planning cycles
Inputs
- Required: Historical metrics (CPU, memory, disk, network, requests/sec)
- Required: Time range for trend analysis (minimum 4 weeks)
- Optional: Business growth projections (expected user growth, feature launches)
- Optional: Budget constraints
Procedure
Step 1: Collect Historical Metrics
Query Prometheus for key resource metrics:
# CPU usage trend over 8 weeks avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) # Memory usage trend avg(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) by (instance) # Disk usage growth avg(node_filesystem_size_bytes - node_filesystem_free_bytes) by (instance, device) # Request rate growth sum(rate(http_requests_total[5m])) by (service) # Database connection pool usage avg(db_connection_pool_used / db_connection_pool_max) by (instance)
Export to analyze:
# Export 8 weeks of CPU data curl -G 'http://prometheus:9090/api/v1/query_range' \ --data-urlencode 'query=avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)' \ --data-urlencode 'start=2024-12-15T00:00:00Z' \ --data-urlencode 'end=2025-02-09T00:00:00Z' \ --data-urlencode 'step=1h' | jq '.data.result' > cpu_8weeks.json
Expected: Clean time series data for each resource with no large gaps.
On failure: Missing data reduces forecast accuracy. Check metric retention and scrape intervals.
Step 2: Calculate Growth Rates with predict_linear
Use Prometheus's
predict_linear() to forecast saturation:
# Predict when CPU will hit 80% (4 weeks ahead) predict_linear( avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))[8w:], 4*7*24*3600 # 4 weeks in seconds ) > 0.80 # Predict disk full date (8 weeks ahead) predict_linear( avg(node_filesystem_size_bytes - node_filesystem_free_bytes)[8w:], 8*7*24*3600 ) > 0.95 * avg(node_filesystem_size_bytes) # Predict memory pressure (2 weeks ahead) predict_linear( avg(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)[8w:], 2*7*24*3600 ) / avg(node_memory_MemTotal_bytes) > 0.90 # Predict request rate capacity breach (4 weeks ahead) predict_linear( sum(rate(http_requests_total[5m]))[8w:], 4*7*24*3600 ) > 10000 # known capacity limit
Create a forecasting dashboard:
{ "dashboard": { "title": "Capacity Forecast", "panels": [ { "title": "CPU Saturation Forecast (4 weeks)", "targets": [ { "expr": "predict_linear(avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[5m]))[8w:], 4*7*24*3600)", "legendFormat": "Predicted CPU" }, { "expr": "0.80", "legendFormat": "Target Threshold (80%)" } ] }, { "title": "Disk Full Date", "targets": [ { "expr": "(avg(node_filesystem_size_bytes) - predict_linear(avg(node_filesystem_free_bytes)[8w:], 8*7*24*3600)) / avg(node_filesystem_size_bytes)", "legendFormat": "Predicted Usage %" } ] } ] } }
Expected: Clear visualization showing when resources will breach thresholds.
On failure: If predictions look wrong (negative values, wild swings), check for:
- Insufficient history (need minimum 4 weeks)
- Step spikes (deployments, migrations) distorting trend
- Seasonal patterns not captured by linear model
Step 3: Calculate Current Headroom
Determine safety margin before saturation:
# CPU headroom (percentage remaining before 80% threshold) (0.80 - avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))) / 0.80 * 100 # Memory headroom (bytes remaining before 90% usage) avg(node_memory_MemAvailable_bytes) - (avg(node_memory_MemTotal_bytes) * 0.10) # Request rate headroom (requests/sec before saturation) 10000 - sum(rate(http_requests_total[5m])) # Time until saturation (weeks until CPU hits 80%) (0.80 - avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))) / deriv(avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))[8w:]) / (7*24*3600)
Create a headroom summary report:
cat > capacity_headroom.md <<'EOF' # Capacity Headroom Report (2025-02-09) ## Current Utilization - **CPU**: 45% average (target: <80%) - **Memory**: 62% (target: <90%) - **Disk**: 71% (target: <95%) - **Request Rate**: 4,200 req/s (capacity: 10,000) ## Headroom Analysis - **CPU**: 35% headroom → ~12 weeks until saturation - **Memory**: 28% headroom → ~16 weeks until saturation - **Disk**: 24% headroom → ~8 weeks until full - **Request Rate**: 5,800 req/s headroom → ~20 weeks until capacity ## Priority Actions 1. **Disk**: Implement log rotation or expand volume within 4 weeks 2. **CPU**: Plan horizontal scaling in next quarter 3. **Memory**: Monitor but no immediate action needed EOF
Expected: Quantified headroom for each resource with time-to-saturation estimates.
On failure: If headroom is already negative, you're in reactive mode. Immediate scaling needed.
Step 4: Model Growth Scenarios
Factor in business projections:
# Example Python script for scenario modeling import pandas as pd import numpy as np # Load historical data df = pd.read_json('cpu_8weeks.json') # Calculate weekly growth rate growth_rate_weekly = df['value'].pct_change(periods=7).mean() # Scenario 1: Current trend weeks_ahead = 12 current_trend = df['value'].iloc[-1] * (1 + growth_rate_weekly) ** weeks_ahead # Scenario 2: 2x user growth (marketing campaign) accelerated_trend = df['value'].iloc[-1] * (1 + growth_rate_weekly * 2) ** weeks_ahead # Scenario 3: New feature launch (+30% baseline) feature_launch = (df['value'].iloc[-1] * 1.30) * (1 + growth_rate_weekly) ** weeks_ahead print(f"Current Trend (12 weeks): {current_trend:.1%} CPU") print(f"2x Growth Scenario: {accelerated_trend:.1%} CPU") print(f"Feature Launch Scenario: {feature_launch:.1%} CPU") print(f"Threshold: 80%")
Expected: Multiple scenarios showing impact of business changes on capacity.
On failure: If scenarios exceed capacity, prioritize scaling before the event.
Step 5: Generate Scaling Recommendations
Create actionable recommendations:
## Capacity Scaling Plan ### Immediate Actions (Next 4 Weeks) 1. **Disk Expansion** [Priority: HIGH] - Current: 500GB, 71% used - Projected full date: 2025-04-01 (8 weeks) - Action: Expand to 1TB by 2025-03-15 - Cost: $50/month additional - Justification: 5 weeks lead time needed 2. **Log Rotation Policy** [Priority: MEDIUM] - Current: Logs retained 90 days - Action: Reduce to 30 days, archive to S3 - Savings: ~150GB disk space - Cost: $5/month S3 storage ### Near-Term Actions (Next Quarter) 3. **Horizontal Scaling - API Tier** [Priority: MEDIUM] - Current: 4 instances, 45% CPU - Projected: 65% CPU by 2025-05-01 - Action: Add 2 instances (to 6 total) - Cost: $400/month - Trigger: When CPU avg exceeds 60% for 7 days 4. **Database Connection Pool** [Priority: LOW] - Current: 50 max connections, 40% used - Projected: 55% by Q3 - Action: Increase to 75 in Q2 - Cost: None (configuration change) ### Long-Term Planning (Next 6 Months) 5. **Migration to Auto-Scaling** [Priority: MEDIUM] - Current: Manual scaling - Action: Implement Kubernetes HPA (Horizontal Pod Autoscaler) - Timeline: Q3 2025 - Benefit: Automatic response to load spikes
Expected: Prioritized list with costs, timelines, and trigger conditions.
On failure: If recommendations are rejected due to cost, revisit thresholds or accept risk.
Step 6: Set Up Capacity Alerts
Create alerts for low headroom:
# capacity_alerts.yml groups: - name: capacity interval: 1h rules: - alert: CPUCapacityLow expr: | (0.80 - avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))) / 0.80 < 0.20 for: 24h labels: severity: warning annotations: summary: "CPU headroom below 20%" description: "Current CPU headroom: {{ $value | humanizePercentage }}. Scaling needed within 4 weeks." - alert: DiskFillForecast expr: | predict_linear(avg(node_filesystem_free_bytes)[8w:], 4*7*24*3600) < 0.10 * avg(node_filesystem_size_bytes) for: 1h labels: severity: warning annotations: summary: "Disk projected to fill within 4 weeks" description: "Expand disk volume soon." - alert: MemoryCapacityLow expr: | avg(node_memory_MemAvailable_bytes) < 0.15 * avg(node_memory_MemTotal_bytes) for: 6h labels: severity: warning annotations: summary: "Memory headroom below 15%"
Expected: Alerts fire before saturation, giving time to scale proactively.
On failure: Tune thresholds if alerts fire too often (alert fatigue) or too late (reactive scrambling).
Validation
- Historical metrics cover at least 8 weeks
-
queries return sensible forecasts (no negative values)predict_linear() - Headroom calculated for all critical resources
- Growth scenarios include business projections
- Scaling recommendations have costs and timelines
- Capacity alerts configured and tested
- Report reviewed with engineering leadership and finance
Common Pitfalls
- Insufficient history: Linear predictions need 4+ weeks of data. Less than that, forecasts are unreliable.
- Ignoring step changes: Deployments, migrations, or feature launches create spikes that distort trends. Filter or annotate.
- Linear assumption: Not all growth is linear. Exponential growth (viral products) needs different models.
- Forgetting lead time: Cloud provisioning is fast, but procurement, budgets, and migrations take weeks. Plan early.
- No budget alignment: Capacity planning without budget buy-in leads to last-minute scrambles. Involve finance early.
Related Skills
- collect the metrics used for capacity planningsetup-prometheus-monitoring
- visualize forecasts and headroombuild-grafana-dashboards
- balance capacity planning with cost optimizationoptimize-cloud-costs