Mc-agent-toolkit monte-carlo-performance-diagnosis
git clone https://github.com/monte-carlo-data/mc-agent-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/performance-diagnosis" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-monte-carlo-performance-diagnosis && rm -rf "$T"
skills/performance-diagnosis/SKILL.mdMonte Carlo Performance Diagnosis Skill
This skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.
Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:
- Tiered investigation approach:
(relative to this file)references/investigation-tiers.md - Query analysis patterns:
(relative to this file)references/query-analysis.md
When to activate this skill
Activate when the user:
- Asks about slow pipelines, jobs, or queries
- Wants to find expensive or costly queries
- Mentions performance regressions or degradation
- Asks "why is this pipeline slow?" or "what's using the most compute?"
- Wants to compare performance over time or find bottleneck tasks
- Asks about failed or futile query patterns
When NOT to activate this skill
Do not activate when the user is:
- Investigating data quality issues (use the prevent skill)
- Looking at storage costs (use the storage-cost-analysis skill)
- Creating monitors (use the monitoring-advisor skill)
- Just querying data or exploring table contents
Prerequisites
The following MCP tools must be available (connect to Monte Carlo's MCP server):
Discovery tools (Tier 1):
-- find slow/failing jobs across Airflow, dbt, Databricksget_jobs_performance
-- find slowest query groups by total runtimeget_top_slow_queries
Bridge tool:
-- convert job MCONs to table MCONsget_tables_for_job
Diagnosis tools (Tier 2):
-- drill into a job's individual tasksget_tasks_performance
-- unified timeline of query changes, volume shifts, Airflow/dbt failuresget_change_timeline
-- root cause analysis for failed/futile queriesget_query_rca
-- latency trend over timeget_query_latency_distribution
-- trace upstream/downstream impactget_asset_lineage
Supporting tools:
-- list available warehousesget_warehouses
Workflow
Step 1: Identify the scope
Determine what the user wants to investigate:
- Specific job/pipeline: User mentions a job name or pipeline
- Specific table: User mentions a table that's slow to update
- General discovery: User wants to find what's slow
Call
get_warehouses to list available warehouses. Match the user's context to a warehouse.
Step 2: Tier 1 -- Discovery
If you don't have specific MCONs to investigate, start with discovery:
-
Find slow jobs: Call
with optionalget_jobs_performance
filter (AIRFLOW, DATABRICKS, DBT) if the user specifies a platform.integration_type- Results include: job name, average duration, trend (7-day), run count, failure rate
- Look for: high
, negativeavgDuration
, high failure ratesrunDurationTrend7d
-
Find expensive queries: Call
with optionalget_top_slow_queries
andwarehouse_id
("read" for SELECTs, "write" for INSERT/CREATE/MERGE).query_type- Results include: query hash, total runtime, average runtime, run count
- Look for: queries with high total runtime or high individual execution time
Present the top findings to the user before drilling deeper. A typical investigation needs only 3-7 tool calls.
If both discovery tools return no results: Tell the user no performance issues were found in the current time window. Suggest broadening the scope (different warehouse, longer time range, or a different platform filter).
Step 3: Bridge -- Job to Tables
After Tier 1 identifies problematic jobs, convert to table MCONs:
Call
get_tables_for_job(job_mcon=..., integration_type=...) using the integration_type from the job performance results.
This gives you the table MCONs needed for Tier 2 investigation.
Step 4: Tier 2 -- Diagnosis
Now drill into root causes using the MCONs from discovery or the bridge:
-
Task bottleneck: Call
to find which specific task in a job is the bottleneck.get_tasks_performance -
What changed? Call
-- this is your most powerful tool. It returns a unified timeline of:get_change_timeline- Query text changes (schema modifications, new JOINs, filter changes)
- Volume shifts (row count spikes/drops)
- Airflow task failures
- dbt model failures All in one call. Look for correlations: "query changed on day X, runtime doubled on day X+1."
-
Why are queries failing? Call
to get root cause analysis:get_query_rca- Failed queries: errors, timeouts, permission issues
- Futile queries: queries that run but produce no useful output
- Patterns are pre-computed -- the tool groups failures by cause
-
Is latency degrading? Call
to see the trend:get_query_latency_distribution- Compare p50 vs p95 -- if p95 >> p50 (>5x), the problem is outlier queries
- Look for step-changes in latency (sudden increase = regression)
-
Trace impact: Call
withget_asset_lineage
to see what's affected by a slow table, ordirection="DOWNSTREAM"
to find what feeds it.direction="UPSTREAM"
Step 5: Present findings
Structure your response as:
- Problem summary: What's slow and by how much (with exact numbers from tools)
- Root cause: What changed or what's causing the issue
- Impact: What downstream systems are affected
- Recommendations: Specific actions to fix the issue
Important rules
- Quote tool numbers exactly. If a tool returns "1282 runs, avg 22.5s", say exactly that. Never round, estimate, or fabricate numbers.
- Always compare to baselines. Use 7-day trend data (
) to distinguish regressions from normal variance. Flag if trend data has less than 0.1 confidence.runDurationTrend7d - Stop when you have a root cause. 3-7 tool calls is typical. More than 10 means you're over-investigating.
- Read vs write queries: When the user asks about "reads" or "read queries", filter with
. When they ask about "writes", usequery_type="read"
. Do NOT mix them.query_type="write" - Never expose MCONs, UUIDs, or internal identifiers to the user. Use human-readable names.
- Cross-platform: This skill works across Airflow, dbt, and Databricks. Note which platform each finding comes from.