Mc-agent-toolkit monte-carlo-monitoring-advisor
Analyze data coverage, create monitors for warehouse tables and AI agents. Covers coverage gaps, use-case analysis, data monitor creation, and agent observability.
git clone https://github.com/monte-carlo-data/mc-agent-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/monte-carlo-data/mc-agent-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/monitoring-advisor" ~/.claude/skills/monte-carlo-data-mc-agent-toolkit-monte-carlo-monitoring-advisor && rm -rf "$T"
skills/monitoring-advisor/SKILL.mdMonte Carlo Monitoring Advisor Skill
This skill handles all monitoring requests -- coverage analysis, data monitor creation, and AI agent monitoring. It routes to the right reference file based on the user's intent.
Reference files live next to this skill file. Use the Read tool (not MCP resources) to access them:
- Data monitor creation procedure:
(relative to this file)references/data-monitor-creation.md - Agent monitor creation procedure:
(relative to this file)references/agent-monitor-creation.md - Per-type references:
andreferences/data-*.md
(relative to this file)references/agent-*.md
When to activate this skill
Activate when the user:
- Asks about monitoring coverage, data coverage, or coverage gaps
- Wants to understand what's monitored vs. not in their warehouse
- Asks about use cases, use-case criticality, or use-case analysis
- Wants to explore their data estate and find what needs monitoring
- Says things like "what should I monitor?", "where are my coverage gaps?", "show me my use cases"
- Asks about unmonitored tables with anomalies or importance-based prioritization
- Asks to create, add, or set up a monitor (e.g. "add a monitor for...", "create a freshness check on...", "set up validation for...")
- Mentions monitoring a specific table, field, or metric
- Wants to check data quality rules or enforce data contracts
- Asks about monitoring options for a table or dataset
- Requests monitors-as-code YAML generation
- Wants to add monitoring after new transformation logic (when the prevent skill is not active)
- Asks about monitoring AI agents, agent latency, agent token usage, or agent quality
- Wants to set up alerts on agent behavior or execution patterns
- Asks about investigating agent traces or conversations
- Says things like "monitor my agent", "track agent latency", "alert on agent errors"
- Asks about agent evaluation monitors, trajectory monitors, or validation monitors
- Mentions agent observability or agent monitoring
When NOT to activate this skill
Do not activate when the user is:
- Just querying data or exploring table contents
- Triaging or responding to active alerts (use the prevent skill's Workflow 3)
- Running impact assessments before code changes (use the prevent skill's Workflow 4)
- Asking about existing monitor configuration (use
directly)get_monitors - Editing or deleting existing monitors
Prerequisites
- Required: Monte Carlo MCP server (
) must be configured and authenticatedmonte-carlo-mcp - Optional: A database MCP server (Snowflake, BigQuery, Redshift, Databricks) for SQL profiling of table usage patterns
Available MCP tools
All tools are available via the
monte-carlo MCP server.
Coverage and discovery tools
| Tool | Purpose |
|---|---|
| List accessible warehouses (needed first -- requires ) |
| List use cases with criticality, descriptions, table counts, precomputed tag names |
| Criticality distribution (HIGH/MEDIUM/LOW table counts) for a use case |
| Paginated tables with criticality, golden-table status, MCONs |
| Check monitoring status on specific tables via filter |
| Upstream/downstream dependencies for tables (takes MCONs + direction) |
| List notification audiences |
| Tables with muted OOTB anomalies but no monitors (takes ISO 8601 time range) |
| Find tables by name; supports filter |
| Table details, fields, stats, domain membership |
| Query logs for a table (source/destination) |
| Available metrics per field type for a warehouse |
| List Monte Carlo domains |
| Available validation rule types |
Data monitor creation tools
| Tool | Purpose |
|---|---|
| Generate table monitor YAML (dry-run) |
| Generate metric monitor YAML (dry-run) |
| Generate validation monitor YAML (dry-run) |
| Generate custom SQL monitor YAML (dry-run) |
| Generate comparison monitor YAML (dry-run) |
Agent monitoring tools
| Tool | Purpose |
|---|---|
| List AI agents -- returns agent names, trace table MCONs, source types |
| Retrieve recent LLM interactions/conversations for an agent |
| Inspect execution traces and span trees |
| Create monitors for quantitative span-level metrics |
| Create monitors for LLM-evaluated quality metrics |
| Create trajectory monitors for execution pattern alerts |
| Create validation monitors for logical assertions |
Routing
When the user's request comes in, determine which workflow to follow:
| User intent | Workflow |
|---|---|
| Coverage analysis, use-case exploration, "what should I monitor?" | Coverage workflow (below) |
| Create a specific data monitor for a known table | Read and follow its procedure |
| Monitor AI agents, agent latency, agent quality, agent traces | Read and follow its procedure |
| Coverage analysis leads to monitor creation | Complete coverage workflow, then read for creation |
When reading reference files, always use the Read tool with the path relative to this skill file.
Coverage workflow
This is the primary flow when the user asks about monitoring coverage, coverage gaps, or what to monitor.
Step 1: Discover warehouses
Call
get_warehouses to list all accessible warehouses.
- If one warehouse: select it automatically, proceed to Step 2.
- If multiple warehouses: present warehouse names (never UUIDs) and ask the user which one to explore.
Step 2: Discover use cases
Call
get_use_cases(warehouse_id=<selected>) to discover use cases for the chosen warehouse.
- If use cases exist --> proceed to the Use-case exploration (below).
- If no use cases --> proceed to the Importance-based fallback (below).
Step 3: Check for database MCP (optional)
Check if the user has a database MCP server available by looking for tools containing
snowflake, bigquery, redshift, or databricks in the tool list. If found, note it for the SQL profiling step later. If not found, skip SQL profiling gracefully.
Use-case exploration
This is the primary flow when use cases are defined.
Present use cases
- Sort by criticality: HIGH before MEDIUM before LOW.
- For each use case, show the description and explain the reasoning for its criticality level so the user understands why it matters.
- Call
withget_use_case_tables
and mention specific golden-table names as concrete examples. Golden tables are the last layer in the warehouse -- they feed ML models, dashboards, and reports. Explain this when relevant.golden_tables_only=true - Use
to explain how tables in a use case are connected and why certain tables are important (e.g. a golden table with many upstream dependencies).get_asset_lineage
Analyze coverage
- Call
to show how many tables exist at each criticality level (HIGH / MEDIUM / LOW) for the use case.get_use_case_table_summary - Call
to obtain table MCONs, then callget_use_case_tables
to report how many are already monitored vs. not.get_monitors(mcons=[...]) - Ask the user which criticality scope they prefer:
- HIGH only -- monitor only the most critical tables
- MEDIUM + HIGH -- broader coverage
- ALL -- full coverage including LOW-criticality tables
- You may suggest covering multiple use cases in one session.
Identify coverage gaps with anomaly data
Use
get_unmonitored_tables_with_anomalies to discover tables that are not monitored but already have muted out-of-the-box anomalies. This reveals real coverage gaps -- places where Monte Carlo detected data issues but no monitor was configured to alert anyone.
- Call it with a recent time window (e.g. last 7-30 days) using ISO 8601 timestamps.
- Results are ranked by importance score -- the most critical gaps appear first.
- Each result includes a sample of anomaly events showing what types of issues were detected (freshness, volume, schema changes).
- Use this to prioritize which unmonitored tables to cover first -- a table with recent anomalies is a stronger candidate than one with no activity.
- Cross-reference with use-case data: if an unmonitored table with anomalies belongs to a critical use case, escalate its priority.
Importance-based fallback
When no use cases are defined, fall back to importance-based table discovery.
- Find unmonitored tables: Use
to find unmonitored tables sorted by importance.search(query="", is_monitored=false) - Find tables with anomalies: Use
with a recent time window (last 14-30 days) to find tables with recent anomalies but no monitors.get_unmonitored_tables_with_anomalies - Inspect top candidates: Use
to check table details, fields, and stats for the most important unmonitored tables.get_table - Understand criticality via lineage: Use
to understand which tables are most connected -- tables with many downstream dependencies are higher priority.get_asset_lineage - Prioritize: Rank candidates by importance score and anomaly activity. Present the top candidates to the user with reasoning.
SQL profiling (optional)
If a database MCP server was detected in Step 3 of the coverage workflow:
- Call
to see recent query patterns on candidate tables.get_queries_for_table - Use the database MCP tools (e.g.
,snowflake_query
) to profile table usage -- identify which tables are queried most frequently, which columns are used in JOINs and WHERE clauses.bigquery_query - Use this information to refine monitor suggestions -- heavily-queried tables with no monitors are high-priority gaps.
If no database MCP is available, skip this step entirely. Do not ask the user to configure one.
Pre-creation context (coverage-driven)
When coverage analysis leads to monitor creation, gather this context before reading the creation reference file:
- Call
to list available notification audiences. Ask the user which audience they want notifications sent to.get_audiences - Ask whether the monitor should be created as a DRAFT or active.
- When passing
oraudiences
, use the audience name/label (not UUID).failure_audiences
Use-case tag monitors
The most common output of coverage analysis is a table monitor scoped by use-case tags via
create_table_monitor_mac. The asset_selection parameter uses this structure:
{ "databases": ["<database_name>"], "schemas": ["<schema_name>"], "filters": [ { "type": "TABLE_TAG", "tableTags": ["<tag_key>:<criticality>"], "tableTagsOperator": "HAS_ANY" } ] }
Rules:
- Filter
is alwaystype
for use-case monitors.TABLE_TAG
should betableTagsOperator
.HAS_ANY- Each entry in
istableTags
where the tag key is the precomputed tag name from"<tag_key>:<value>"
output and the value is the criticality level in lowercase (get_use_cases
,high
,medium
).low - To monitor only HIGH-criticality tables:
["tag_name:high"] - To monitor MEDIUM + HIGH:
["tag_name:high", "tag_name:medium"] - To monitor ALL:
["tag_name:high", "tag_name:medium", "tag_name:low"]
Monitor description guidelines
Write a clear, meaningful
description that explains what the monitor covers and why. The backend auto-generates the monitor name -- you cannot control it, but the description is what users see.
- Bad:
"Data Quality Monitoring - HIGH criticality table monitor" - Good:
"Monitor HIGH criticality tables in the Revenue Reporting use case to catch issues before they affect dashboards and financial reports."
The description should mention the criticality scope, the use case name, and a brief reason why this monitoring matters.
Transient and truncate-and-reload tables
Some tables show 0 rows when queried directly but have recent write activity in Monte Carlo metadata. These are transient tables -- fully replaced on each pipeline run (truncate-and-reload pattern). Recognize this pattern early to avoid wasting time querying empty tables.
Signs of a transient table:
shows recentget_table
timestamp and high read/write activitylast_write- Direct SQL query returns 0 rows or all-NULL timestamp columns
- Monte Carlo detected freshness anomalies (the table stayed empty longer than expected between loads)
Graceful degradation
Handle missing or unavailable tools gracefully:
| Scenario | Behavior |
|---|---|
| No use cases defined | Fall back to importance-based discovery |
| No database MCP available | Skip SQL profiling, rely on MC tools only |
returns empty | Note that no recent anomalies were found; proceed with use-case or importance-based prioritization |
returns no tables | Note the use case has no tables; suggest exploring other use cases |
returns empty | Inform user no audiences are configured; monitors can still be created without notification routing |
| User has no warehouses | Inform user that no warehouses are accessible; they may need to check their Monte Carlo permissions |
Never error out or stop the conversation because one tool returned empty results. Explain what happened and offer the next best path.
Rules
- Never expose UUIDs, MCONs, or internal identifiers to the user -- always use human-readable names for warehouses, audiences, use cases, and tables. Keep internal identifiers for tool calls only.
- When the user asks about relationships between tables, use
to fetch upstream/downstream connections and explain the data flow.get_asset_lineage - Be concise but thorough. Use bullet points and tables for clarity.
- Always use ISO 8601 format for datetime values in tool calls.
- Never reformat YAML values returned by creation tools.
- When passing
oraudiences
to monitor creation tools, use the audience name/label (not UUID). The API accepts audience names.failure_audiences