Awesome-omni-skill dd-monitors
Monitor management - create, update, mute, and alerting best practices.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/cli-automation/dd-monitors" ~/.claude/skills/diegosouzapw-awesome-omni-skill-dd-monitors && rm -rf "$T"
manifest:
skills/cli-automation/dd-monitors/SKILL.mdsource content
Datadog Monitors
Create, manage, and maintain monitors for alerting.
Prerequisites
This requires Go or the pup binary in your path.
pup - go install github.com/datadog-labs/pup@latest
Ensure ~/go/bin is in $PATH.
Quick Start
pup auth login
Common Operations
List Monitors
pup monitors list pup monitors list --tags "team:platform" pup monitors list --status "Alert"
Get Monitor
pup monitors get <id> --json
Create Monitor
pup monitors create \ --name "High CPU on web servers" \ --type "metric alert" \ --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \ --message "CPU above 80% @slack-ops"
Mute/Unmute
# Mute with duration pup monitors mute --id 12345 --duration 1h # Or mute with specific end time pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z" # Unmute pup monitors unmute --id 12345
⚠️ Monitor Creation Best Practices
1. Avoid Alert Fatigue
| Rule | Why |
|---|---|
| No flapping alerts | Use not |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | in message |
# WRONG - will flap constantly query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive # CORRECT - stable alerting query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
2. Use Proper Scoping
# WRONG - alerts on everything query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope # CORRECT - scoped to what matters query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
3. Set Recovery Thresholds
monitor = { "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80", "options": { "thresholds": { "critical": 80, "critical_recovery": 70, # ✅ Prevents flapping "warning": 60, "warning_recovery": 50 } } }
4. Include Context in Messages
message = """ ## High CPU Alert Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}} ### Runbook 1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'` 2. Check recent deploys 3. Scale if needed @slack-ops @pagerduty-oncall """
⚠️ NEVER Delete Monitors Directly
Use safe deletion workflow (same as dashboards):
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool: """Mark monitor instead of deleting.""" monitor = client.get_monitor(monitor_id) name = monitor.get("name", "") if "[MARKED FOR DELETION]" in name: print(f"Already marked: {name}") return False new_name = f"[MARKED FOR DELETION] {name}" client.update_monitor(monitor_id, {"name": new_name}) print(f"✓ Marked: {new_name}") return True
Monitor Types
| Type | Use Case |
|---|---|
| CPU, memory, custom metrics |
| Complex metric queries |
| Agent check status |
| Event stream patterns |
| Log pattern matching |
| Combine multiple monitors |
| APM metrics |
Audit Monitors
# Find monitors without owners pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}' # Find noisy monitors (high alert count) pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
Downtime vs Muting
| Use | When |
|---|---|
| Mute monitor | Quick one-off, < 1 hour |
| Downtime | Scheduled maintenance, recurring |
# Downtime (preferred) pup downtime create \ --scope "env:prod" \ --monitor-tags "team:platform" \ --start "2024-01-15T02:00:00Z" \ --end "2024-01-15T06:00:00Z"
Failure Handling
| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | |