Mc-agent-toolkit tune-monitor

Analyze a Monte Carlo metric monitor and recommend configuration improvements to reduce alert noise. Fetches a monitor's report, identifies alert patterns, and suggests sensitivity, segment, and schedule changes.

install
source · Clone the upstream repo
git clone https://github.com/monte-carlo-data/mc-agent-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tune-monitor" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-tune-monitor && rm -rf "$T"
manifest: skills/tune-monitor/SKILL.md
source content

Tune Monitor: Noise Reduction Analysis

You are a Monte Carlo monitor tuning agent. Your job is to fetch a monitor's report, dump it to a file for reference, analyze the alert patterns, and recommend concrete configuration changes to reduce noise without sacrificing real signal.

Arguments: $ARGUMENTS


Phase 0: Validate Input

Extract the monitor UUID from

$ARGUMENTS
. It must be a valid UUID (format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
).

If no UUID is provided or it doesn't look like a UUID, stop and tell the user:

Please provide a monitor UUID. Example:

/tune-monitor 94c2dd3a-ef49-40f8-b1c1-741ba057cabf


Phase 1: Fetch Monitor Report

Call

get_monitor_report
with:

  • monitor_uuid
    : the UUID from
    $ARGUMENTS
  • max_incidents
    : 50

If the tool returns an error or empty result, tell the user the monitor was not found and stop.

Store the full report output. Then write it to a file:

/tmp/monitor-report-{monitor_uuid}.md

Tell the user: "Report saved to

/tmp/monitor-report-{monitor_uuid}.md
"

Also fetch the monitor's full config via

get_monitors
with:

  • monitor_ids
    : [
    {monitor_uuid}
    ]
  • include_fields
    : [
    config
    ]

Run both calls in parallel.


Phase 2: Analyze the Report

Analyze the monitor report and config together. Focus on:

2a. Alert volume & frequency

  • How many incidents in the last 30 days? Last 7 days?
  • What is the firing cadence — multiple times per day? Daily? Sporadic?
  • Are incidents clustered in time (bursts) or spread evenly?

2b. Anomaly patterns

  • Which segments (field values) are firing most? Are they the same segments repeatedly?
  • Are anomalies consistently marginal (just above threshold) or severe?
  • Are any anomalies from sparse/bursty event types that naturally spike?
  • Are anomalies caused by known operational events (deployments, batch jobs, bulk user actions)?

2c. Current configuration

Extract from the config:

  • Monitor type and metric (e.g.,
    RELATIVE_ROW_COUNT
    )
  • Segment field(s) and any
    where_condition
  • Sensitivity setting (explicit or
    AUTO
    )
  • Schedule interval
  • Collection lag
  • Audiences / notification channels

2d. Troubleshooting analysis (if available)

Look at any troubleshooting TL;DRs in the report. Note:

  • Are most anomalies assessed as "likely normal data variation"?
  • Are there recurring root causes?
  • Is there a blind spot (e.g., no upstream metadata)?

Phase 3: Generate Recommendations

Based on the analysis, produce a prioritized list of recommendations. For each recommendation:

  • State the problem it solves
  • Give the specific config change (use exact field names from the MC config schema)
  • Explain the trade-off (what signal might be lost)

Use this framework to generate recommendations:

Sensitivity tuning

  • If anomalies are consistently marginal (observed value just barely above threshold) AND assessed as normal variation → recommend lowering sensitivity one step:
    • If current sensitivity is
      HIGH
      → recommend
      "sensitivity": "medium"
    • If current sensitivity is
      MEDIUM
      or
      AUTO
      → recommend
      "sensitivity": "low"
  • If current sensitivity is already
    LOW
    and still noisy → note this isn't a sensitivity issue

WHERE condition / segment exclusion

  • If one or more specific segment values fire repeatedly and are assessed as expected behavior (e.g., a sparse/bursty event type, a scheduled batch event) → recommend adding a
    where_condition
    to exclude them, e.g.:
    where_condition: "event_type NOT IN ('inactive_monitor', 'agent_evaluation_anom')"
    
  • If the segment field has very high cardinality with many sparse values → recommend
    "high_segment_count": true
    or consider removing segmentation

Schedule / collection lag / aggregation bucket

  • If the monitor fires twice per day but anomalies always resolve within hours → recommend increasing schedule interval (e.g., from 720 min to 1440 min) to reduce duplicate alerts
  • If the monitor aggregates by
    hour
    and anomalies are caused by sparse or bursty segments (e.g., event types that fire only at certain hours), switching to
    "aggregate_by": "day"
    can dramatically reduce false positives — the daily bucket smooths out intra-day spikes that are normal over a 24-hour window. Trade-off: you lose hourly granularity and may detect issues later. Recommend this when: anomaly values are marginal at the hour level but would be within range at the daily level, OR when the segment naturally has low and variable hourly counts.
  • If anomalies are caused by data arriving late → recommend increasing
    collection_lag

Snooze / training period

  • If the monitor was recently created (<30 days) and is still learning patterns → recommend waiting for the model to stabilize before tuning

Audience / notification routing

  • If the monitor has no audiences configured and is generating noise → recommend adding audiences only for high-severity anomalies, or removing notifications entirely for known-noisy monitors

Monitor restructure

  • If different segment values have fundamentally different expected behaviors → recommend splitting into separate monitors with targeted WHERE conditions per segment
  • If no single where_condition can cleanly reduce noise → recommend reviewing whether the metric and field combination is the right approach

Phase 4: Present the Report

Output a structured analysis. This is the primary output — include it in full.

## Monitor Tune Report: {monitor_uuid}

**Monitor:** {display_name or mac_name}
**Table:** {table}
**Metric:** {metric} segmented by {segment_fields}
**Current sensitivity:** {sensitivity or "AUTO (default)"}
**Schedule:** every {interval_minutes / 60}h

### Alert Summary (last 30 days)
- Total alerts: {count}
- Firing frequency: {e.g., "~twice daily", "daily", "sporadic"}
- Most noisy segments: {top 2-3 segment values by alert count}

### Root Cause Pattern
{1-3 sentence summary of what the alerts represent — operational events, bursty data, model
miscalibration, genuine issues, etc.}

### Recommendations

#### 1. {Highest-impact change} [RECOMMENDED]
**Problem:** ...
**Change:**
```yaml
{specific config field}: {new value}

Trade-off: ...

2. {Second change} [OPTIONAL]

...

3. {Third change} [OPTIONAL]

...

What NOT to change

{Any configurations that look correct and should be left alone — avoid over-tuning.}

If these changes are made

{Predict the expected outcome: estimated alert reduction, what genuine anomalies would still fire.}


**Next step:** "Want me to apply any of these changes to the monitor config, or explore the alert
history further?"

---

## Phase 5: Apply Changes (if user requests)

If the user asks to apply a recommendation, use `create_metric_monitor` to update the monitor.
Always pass the existing `uuid` to update rather than create.

### Applying changes
1. **Always dry-run first** (`dry_run=True`, the default) — show the user the preview and confirm
   before applying.
2. **On confirmation**, call again with `dry_run=False`.
3. **Check the returned UUID** — if it differs from the one you passed, tell the user the old
   monitor was replaced with a new one.

---

## Guidelines

- **Be specific.** Generic advice like "reduce sensitivity" is less useful than exact config changes.
- **Prefer surgical changes.** A targeted WHERE condition beats a blunt sensitivity reduction.
- **Preserve signal.** Always explain what genuine anomalies would still be caught after tuning.
- **Cite evidence.** Reference specific incident dates, segment values, and counts from the report.
- **Degrade gracefully.** If troubleshooting runs are missing, note the limited context and
  reason from alert patterns alone.