Mc-agent-toolkit monte-carlo-performance-diagnosis

install

source · Clone the upstream repo

git clone https://github.com/monte-carlo-data/mc-agent-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/performance-diagnosis" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-monte-carlo-performance-diagnosis && rm -rf "$T"

manifest: skills/performance-diagnosis/SKILL.md

source content

Monte Carlo Performance Diagnosis Skill

This skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.

Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:

Tiered investigation approach:
```
references/investigation-tiers.md
```
(relative to this file)
Query analysis patterns:
```
references/query-analysis.md
```
(relative to this file)

When to activate this skill

Activate when the user:

Asks about slow pipelines, jobs, or queries
Wants to find expensive or costly queries
Mentions performance regressions or degradation
Asks "why is this pipeline slow?" or "what's using the most compute?"
Wants to compare performance over time or find bottleneck tasks
Asks about failed or futile query patterns

When NOT to activate this skill

Do not activate when the user is:

Investigating data quality issues (use the prevent skill)
Looking at storage costs (use the storage-cost-analysis skill)
Creating monitors (use the monitoring-advisor skill)
Just querying data or exploring table contents

Prerequisites

The following MCP tools must be available (connect to Monte Carlo's MCP server):

Discovery tools (Tier 1):

```
get_jobs_performance
```
-- find slow/failing jobs across Airflow, dbt, Databricks
```
get_top_slow_queries
```
-- find slowest query groups by total runtime

Bridge tool:

```
get_tables_for_job
```
-- convert job MCONs to table MCONs

Diagnosis tools (Tier 2):

```
get_tasks_performance
```
-- drill into a job's individual tasks
```
get_change_timeline
```
-- unified timeline of query changes, volume shifts, Airflow/dbt failures
```
get_query_rca
```
-- root cause analysis for failed/futile queries
```
get_query_latency_distribution
```
-- latency trend over time
```
get_asset_lineage
```
-- trace upstream/downstream impact

Supporting tools:

```
get_warehouses
```
-- list available warehouses

Workflow

Step 1: Identify the scope

Determine what the user wants to investigate:

Specific job/pipeline: User mentions a job name or pipeline
Specific table: User mentions a table that's slow to update
General discovery: User wants to find what's slow

Call

get_warehouses

to list available warehouses. Match the user's context to a warehouse.

Step 2: Tier 1 -- Discovery

If you don't have specific MCONs to investigate, start with discovery:

Find slow jobs: Call
```
get_jobs_performance
```
with optional
```
integration_type
```
filter (AIRFLOW, DATABRICKS, DBT) if the user specifies a platform.
- Results include: job name, average duration, trend (7-day), run count, failure rate
- Look for: high
```
avgDuration
```
  , negative
```
runDurationTrend7d
```
  , high failure rates
Find expensive queries: Call
```
get_top_slow_queries
```
with optional
```
warehouse_id
```
and
```
query_type
```
("read" for SELECTs, "write" for INSERT/CREATE/MERGE).
- Results include: query hash, total runtime, average runtime, run count
- Look for: queries with high total runtime or high individual execution time

Present the top findings to the user before drilling deeper. A typical investigation needs only 3-7 tool calls.

If both discovery tools return no results: Tell the user no performance issues were found in the current time window. Suggest broadening the scope (different warehouse, longer time range, or a different platform filter).

Step 3: Bridge -- Job to Tables

After Tier 1 identifies problematic jobs, convert to table MCONs:

Call

get_tables_for_job(job_mcon=..., integration_type=...)

using the

integration_type

from the job performance results.

This gives you the table MCONs needed for Tier 2 investigation.

Step 4: Tier 2 -- Diagnosis

Now drill into root causes using the MCONs from discovery or the bridge:

Task bottleneck: Call
```
get_tasks_performance
```
to find which specific task in a job is the bottleneck.
What changed? Call
```
get_change_timeline
```
-- this is your most powerful tool. It returns a unified timeline of:
- Query text changes (schema modifications, new JOINs, filter changes)
- Volume shifts (row count spikes/drops)
- Airflow task failures
- dbt model failures All in one call. Look for correlations: "query changed on day X, runtime doubled on day X+1."
Why are queries failing? Call
```
get_query_rca
```
to get root cause analysis:
- Failed queries: errors, timeouts, permission issues
- Futile queries: queries that run but produce no useful output
- Patterns are pre-computed -- the tool groups failures by cause
Is latency degrading? Call
```
get_query_latency_distribution
```
to see the trend:
- Compare p50 vs p95 -- if p95 >> p50 (>5x), the problem is outlier queries
- Look for step-changes in latency (sudden increase = regression)
Trace impact: Call
```
get_asset_lineage
```
with
```
direction="DOWNSTREAM"
```
to see what's affected by a slow table, or
```
direction="UPSTREAM"
```
to find what feeds it.

Step 5: Present findings

Structure your response as:

Problem summary: What's slow and by how much (with exact numbers from tools)
Root cause: What changed or what's causing the issue
Impact: What downstream systems are affected
Recommendations: Specific actions to fix the issue

Important rules

Quote tool numbers exactly. If a tool returns "1282 runs, avg 22.5s", say exactly that. Never round, estimate, or fabricate numbers.
Always compare to baselines. Use 7-day trend data (
```
runDurationTrend7d
```
) to distinguish regressions from normal variance. Flag if trend data has less than 0.1 confidence.
Stop when you have a root cause. 3-7 tool calls is typical. More than 10 means you're over-investigating.
Read vs write queries: When the user asks about "reads" or "read queries", filter with
```
query_type="read"
```
. When they ask about "writes", use
```
query_type="write"
```
. Do NOT mix them.
Never expose MCONs, UUIDs, or internal identifiers to the user. Use human-readable names.
Cross-platform: This skill works across Airflow, dbt, and Databricks. Note which platform each finding comes from.