Mc-agent-toolkit monte-carlo-analyze-root-cause
git clone https://github.com/monte-carlo-data/mc-agent-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/analyze-root-cause" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-monte-carlo-analyze-root-cause && rm -rf "$T"
skills/analyze-root-cause/SKILL.mdMonte Carlo Root Cause Analysis Skill
This skill helps investigate data incidents — freshness delays, volume anomalies, schema changes, field metric drift, and ETL failures — by guiding the agent through a systematic investigation using Monte Carlo's MCP tools. It combines observability metadata with optional direct data querying to find the root cause.
Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:
- Investigation playbooks by issue type:
references/<type>-investigation.md - Data exploration patterns:
references/data-exploration.md - Intake when no incident ID:
references/intake-no-incident.md - Common root cause catalog:
references/common-root-causes.md
When to activate this skill
Activate when the user:
- Mentions a Monte Carlo alert, incident, or anomaly
- Asks "why is this table stale?" or "why did row count drop?"
- Wants to investigate a data quality issue
- Asks about freshness, volume, or schema problems
- Mentions pipeline failures (Airflow, dbt, Databricks)
- Says things like "debug this alert", "investigate this incident", "root cause analysis"
When NOT to activate this skill
Do not activate when the user is:
- Creating monitors (use the monitoring-advisor skill)
- Running impact assessments before code changes (use the prevent skill)
- Looking at storage costs (use the storage-cost-analysis skill)
- Exploring pipeline performance without a specific incident (use the performance-diagnosis skill)
Prerequisites
Required: Monte Carlo MCP server (
integrations.getmontecarlo.com/mcp) must be configured and authenticated.
Optional but recommended:
- Database MCP server (Snowflake, BigQuery, Redshift, Databricks) — enables direct SQL queries for deeper data investigation. Without this, the skill can still analyze using MC's metadata tools but cannot profile actual data.
- GitHub MCP server — enables searching for recent PRs that may have caused the issue. Without this, the skill falls back to MC's query change detection.
MCP Tools Used
From Monte Carlo MCP server
| Tool | Purpose |
|---|---|
| Fetch incident/alert details |
| Find tables by name or keyword |
| Table metadata and fields |
| Table-level upstream/downstream lineage |
| Field-level lineage (trace bad data to source column) |
| Table update/freshness history |
| Row count and size history |
| Read/write query history |
| Detect SQL text modifications |
| Root cause analysis for failed/futile/missed queries |
| ETL pipeline issues — pass ("airflow", "dbt", or "databricks") |
| Find ETL jobs that write to specific tables — pass param |
| Recent GitHub PRs from the account's MC GitHub integration |
| Job runtime stats, failure rates, 7-day trends |
| Unified timeline: query changes + volume + ETL failures |
| Current timestamp for relative time ranges |
Optional external MCP tools
| Tool | Purpose |
|---|---|
| Database MCP (Snowflake, BigQuery, etc.) | Run SQL queries for data profiling |
| GitHub MCP | Search for recent PRs (alternative to MC's — useful if the account has no MC GitHub integration) |
Workflow
Step 1: Understand the problem (intake)
If the user provides an alert or incident ID:
- Call
with the alert ID to fetch details.get_alerts - Identify: affected table(s), issue type (freshness, volume, schema, field metric), when it started.
- Proceed to Step 2.
If the user describes a problem WITHOUT an incident ID: Read
references/intake-no-incident.md for the full intake flow. In short:
- Ask clarifying questions: what table? what looks wrong? when did it start?
- Search for the table:
search(query="table_name") - Search for related alerts:
with a recent time rangeget_alerts - Check table health:
,get_table_freshnessget_table_size_history - Narrow down the issue type and proceed to Step 2.
Step 2: Map the blast radius
- Call
— what feeds this table?get_asset_lineage(mcons=[table_mcon], direction="UPSTREAM") - Call
— what does this table feed?get_asset_lineage(mcons=[table_mcon], direction="DOWNSTREAM") - If the issue involves specific fields, call
to trace which upstream fields feed the affected columns.get_field_lineage
Report to the user: "This table is fed by X upstream sources and feeds Y downstream consumers. Here's what could be impacted."
Ask for direction: Before diving deeper, ask the user what they'd like to investigate first. They may already have a hunch ("I think it's the Airflow job" or "check if someone changed the SQL"). Follow their lead — don't run all investigation paths blindly. If they have no preference, proceed with the most likely path based on the issue type.
Step 3: Investigate based on issue type
Read the appropriate reference file and follow its investigation playbook:
| Issue Type | Reference |
|---|---|
| Table not updating on schedule | |
| Unexpected row count changes | |
| Columns added, removed, or type-changed | |
| Airflow/dbt/Databricks pipeline failures | |
| SQL modifications causing data changes | |
| Field-level metric drift (null rate, mean, etc.) | |
Step 4: Check for upstream causes
Data issues often originate upstream. Walk the lineage chain:
- For each direct upstream table from Step 2:
- Check freshness:
— is the upstream table also stale?get_table_freshness - Check size:
— did the upstream table's volume change?get_table_size_history - Check ETL status:
with the relevantget_etl_issuesplatform
- Check freshness:
- Use
to trace the specific field that has bad data back to its source.get_field_lineage - Check what upstream field values correlate with the anomaly (if DB connector is available — see Step 5).
Step 5: Profile data (if database MCP is available)
If the user has a database MCP server connected (Snowflake, BigQuery, Redshift, Databricks, etc.), read
references/data-exploration.md for SQL investigation patterns including:
- Sample rows around the incident time
- Null rate and distribution checks
- Value correlation with upstream tables
- Before/after comparisons
If no database MCP is available: Tell the user: "I can't query the warehouse directly — for deeper data investigation, connect a database MCP server. I can still analyze using Monte Carlo's metadata and the tools available." Continue the investigation with MC tools only.
Step 6: Check for code changes
Call
get_github_prs with a time range around when the issue started to find recent PRs from the account's Monte Carlo GitHub integration. Look for PRs that modified dbt models, SQL files, or pipeline configs affecting the impacted table.
If the account has no GitHub integration (tool returns empty), or the user has a local GitHub MCP server they prefer, use that instead.
Also call
get_query_changes with the affected table MCONs to detect SQL text modifications, and get_change_timeline for a unified view of all changes (query modifications + volume shifts + ETL failures) in one call.
Step 7: Synthesize and present
Read
references/common-root-causes.md to match findings against known patterns. Present:
- Root cause — what happened and when, with evidence from tools
- Evidence chain — which tools confirmed each piece of the story
- Impact — what downstream tables/consumers are affected (from Step 2)
- Recommended fix — specific action to resolve the issue
- Prevention — suggest monitoring to catch this earlier next time
Important rules
- Never fabricate data. Only cite numbers and facts returned by tools. If a tool returned no data, say so.
- Follow the evidence. If upstream lineage shows no issues, the problem is likely in the table's own ETL. Don't chase phantom upstream causes.
- Check the timeline. The most common pattern is: "X changed at time T, and the anomaly started at time T+1." Use
for this.get_change_timeline - Be specific about what you can't check. If no DB connector is available, explain what additional investigation would be possible with one.
- Never expose MCONs, UUIDs, or internal identifiers to the user. Use human-readable table names.
- Cross-platform awareness. ETL issues can come from Airflow, dbt, or Databricks. Check all platforms that are relevant.