Awesome-omni-skill dd-monitors

Monitor management - create, update, mute, and alerting best practices.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/cli-automation/dd-monitors" ~/.claude/skills/diegosouzapw-awesome-omni-skill-dd-monitors && rm -rf "$T"

manifest: skills/cli-automation/dd-monitors/SKILL.md

source content

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires Go or the pup binary in your path.

pup

go install github.com/datadog-labs/pup@latest

Ensure

~/go/bin

is in

$PATH

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

pup monitors get <id> --json

Create Monitor

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

# Mute with duration
pup monitors mute --id 12345 --duration 1h

# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

# Unmute
pup monitors unmute --id 12345

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

Rule	Why
No flapping alerts	Use `last_Xm` not `last_1m`
Meaningful thresholds	Based on SLOs, not guesses
Actionable alerts	If no action needed, don't alert
Include runbook	`@runbook-url` in message

# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

Type	Use Case
`metric alert`	CPU, memory, custom metrics
`query alert`	Complex metric queries
`service check`	Agent check status
`event alert`	Event stream patterns
`log alert`	Log pattern matching
`composite`	Combine multiple monitors
`apm`	APM metrics

Audit Monitors

# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

Use	When
Mute monitor	Quick one-off, < 1 hour
Downtime	Scheduled maintenance, recurring

# Downtime (preferred)
pup downtime create \
  --scope "env:prod" \
  --monitor-tags "team:platform" \
  --start "2024-01-15T02:00:00Z" \
  --end "2024-01-15T06:00:00Z"

Failure Handling

Problem	Fix
Alert not firing	Check query returns data, thresholds
Too many alerts	Increase window, add recovery threshold
No data alerts	Check agent connectivity, metric exists
Auth error	`pup auth refresh`