Mc-agent-toolkit monte-carlo-performance-diagnosis

install
source · Clone the upstream repo
git clone https://github.com/monte-carlo-data/mc-agent-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/performance-diagnosis" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-monte-carlo-performance-diagnosis && rm -rf "$T"
manifest: skills/performance-diagnosis/SKILL.md
source content

Monte Carlo Performance Diagnosis Skill

This skill helps diagnose data pipeline performance issues using Monte Carlo's cross-platform observability data. It works across Airflow, dbt, Databricks, and warehouse query engines to find bottlenecks, detect regressions, and identify root causes.

Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:

  • Tiered investigation approach:
    references/investigation-tiers.md
    (relative to this file)
  • Query analysis patterns:
    references/query-analysis.md
    (relative to this file)

When to activate this skill

Activate when the user:

  • Asks about slow pipelines, jobs, or queries
  • Wants to find expensive or costly queries
  • Mentions performance regressions or degradation
  • Asks "why is this pipeline slow?" or "what's using the most compute?"
  • Wants to compare performance over time or find bottleneck tasks
  • Asks about failed or futile query patterns

When NOT to activate this skill

Do not activate when the user is:

  • Investigating data quality issues (use the prevent skill)
  • Looking at storage costs (use the storage-cost-analysis skill)
  • Creating monitors (use the monitoring-advisor skill)
  • Just querying data or exploring table contents

Prerequisites

The following MCP tools must be available (connect to Monte Carlo's MCP server):

Discovery tools (Tier 1):

  • get_jobs_performance
    -- find slow/failing jobs across Airflow, dbt, Databricks
  • get_top_slow_queries
    -- find slowest query groups by total runtime

Bridge tool:

  • get_tables_for_job
    -- convert job MCONs to table MCONs

Diagnosis tools (Tier 2):

  • get_tasks_performance
    -- drill into a job's individual tasks
  • get_change_timeline
    -- unified timeline of query changes, volume shifts, Airflow/dbt failures
  • get_query_rca
    -- root cause analysis for failed/futile queries
  • get_query_latency_distribution
    -- latency trend over time
  • get_asset_lineage
    -- trace upstream/downstream impact

Supporting tools:

  • get_warehouses
    -- list available warehouses

Workflow

Step 1: Identify the scope

Determine what the user wants to investigate:

  • Specific job/pipeline: User mentions a job name or pipeline
  • Specific table: User mentions a table that's slow to update
  • General discovery: User wants to find what's slow

Call

get_warehouses
to list available warehouses. Match the user's context to a warehouse.

Step 2: Tier 1 -- Discovery

If you don't have specific MCONs to investigate, start with discovery:

  1. Find slow jobs: Call

    get_jobs_performance
    with optional
    integration_type
    filter (AIRFLOW, DATABRICKS, DBT) if the user specifies a platform.

    • Results include: job name, average duration, trend (7-day), run count, failure rate
    • Look for: high
      avgDuration
      , negative
      runDurationTrend7d
      , high failure rates
  2. Find expensive queries: Call

    get_top_slow_queries
    with optional
    warehouse_id
    and
    query_type
    ("read" for SELECTs, "write" for INSERT/CREATE/MERGE).

    • Results include: query hash, total runtime, average runtime, run count
    • Look for: queries with high total runtime or high individual execution time

Present the top findings to the user before drilling deeper. A typical investigation needs only 3-7 tool calls.

If both discovery tools return no results: Tell the user no performance issues were found in the current time window. Suggest broadening the scope (different warehouse, longer time range, or a different platform filter).

Step 3: Bridge -- Job to Tables

After Tier 1 identifies problematic jobs, convert to table MCONs:

Call

get_tables_for_job(job_mcon=..., integration_type=...)
using the
integration_type
from the job performance results.

This gives you the table MCONs needed for Tier 2 investigation.

Step 4: Tier 2 -- Diagnosis

Now drill into root causes using the MCONs from discovery or the bridge:

  1. Task bottleneck: Call

    get_tasks_performance
    to find which specific task in a job is the bottleneck.

  2. What changed? Call

    get_change_timeline
    -- this is your most powerful tool. It returns a unified timeline of:

    • Query text changes (schema modifications, new JOINs, filter changes)
    • Volume shifts (row count spikes/drops)
    • Airflow task failures
    • dbt model failures All in one call. Look for correlations: "query changed on day X, runtime doubled on day X+1."
  3. Why are queries failing? Call

    get_query_rca
    to get root cause analysis:

    • Failed queries: errors, timeouts, permission issues
    • Futile queries: queries that run but produce no useful output
    • Patterns are pre-computed -- the tool groups failures by cause
  4. Is latency degrading? Call

    get_query_latency_distribution
    to see the trend:

    • Compare p50 vs p95 -- if p95 >> p50 (>5x), the problem is outlier queries
    • Look for step-changes in latency (sudden increase = regression)
  5. Trace impact: Call

    get_asset_lineage
    with
    direction="DOWNSTREAM"
    to see what's affected by a slow table, or
    direction="UPSTREAM"
    to find what feeds it.

Step 5: Present findings

Structure your response as:

  1. Problem summary: What's slow and by how much (with exact numbers from tools)
  2. Root cause: What changed or what's causing the issue
  3. Impact: What downstream systems are affected
  4. Recommendations: Specific actions to fix the issue

Important rules

  • Quote tool numbers exactly. If a tool returns "1282 runs, avg 22.5s", say exactly that. Never round, estimate, or fabricate numbers.
  • Always compare to baselines. Use 7-day trend data (
    runDurationTrend7d
    ) to distinguish regressions from normal variance. Flag if trend data has less than 0.1 confidence.
  • Stop when you have a root cause. 3-7 tool calls is typical. More than 10 means you're over-investigating.
  • Read vs write queries: When the user asks about "reads" or "read queries", filter with
    query_type="read"
    . When they ask about "writes", use
    query_type="write"
    . Do NOT mix them.
  • Never expose MCONs, UUIDs, or internal identifiers to the user. Use human-readable names.
  • Cross-platform: This skill works across Airflow, dbt, and Databricks. Note which platform each finding comes from.