Claude-code-plugins snowflake-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/snowflake-pack/skills/snowflake-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-snowflake-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/snowflake-pack/skills/snowflake-incident-runbook/SKILL.mdsource content
Snowflake Incident Runbook
Overview
Rapid incident response procedures for Snowflake infrastructure, pipeline failures, and query issues.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | All queries failing, auth broken |
| P2 | Degraded service | < 1 hour | High latency, task failures |
| P3 | Minor impact | < 4 hours | Snowpipe delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps, cost anomalies |
Quick Triage (First 5 Minutes)
Step 1: Is Snowflake Itself Down?
# Check Snowflake status page curl -s https://status.snowflake.com/api/v2/summary.json | python3 -c " import sys, json data = json.load(sys.stdin) print(f\"Status: {data['status']['description']}\") for c in data['components']: if c['status'] != 'operational': print(f\" DEGRADED: {c['name']} - {c['status']}\") "
Step 2: Can We Connect?
-- Quick connectivity test SELECT CURRENT_TIMESTAMP(), CURRENT_ACCOUNT(), CURRENT_REGION(); -- If this fails, the issue is connectivity/auth, not query logic
Step 3: What's Failing?
-- Recent failures (last 30 minutes) SELECT error_code, error_message, COUNT(*) AS occurrences, MIN(start_time) AS first_seen, MAX(start_time) AS last_seen FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE execution_status = 'FAIL' AND start_time >= DATEADD(minutes, -30, CURRENT_TIMESTAMP()) GROUP BY error_code, error_message ORDER BY occurrences DESC; -- Failed tasks SELECT name, state, error_message, scheduled_time, completed_time FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY( SCHEDULED_TIME_RANGE_START => DATEADD(hours, -1, CURRENT_TIMESTAMP()) )) WHERE state = 'FAILED' ORDER BY scheduled_time DESC; -- Stale streams (data loss risk) SHOW STREAMS; -- Check STALE column — if TRUE, stream offset is beyond retention
Decision Tree
Query failures? ├─ Auth errors (390100, 390144) │ → Check credentials, key pair, network policy ├─ Object not found (002003) │ → Wrong context? Permissions? Object dropped? ├─ Warehouse issues (000606) │ → Warehouse suspended? Resource monitor hit? ├─ Timeout (100038) │ → Query too slow? Warehouse too small? └─ Snowflake platform issue (5xx, connectivity) → Check status.snowflake.com → enable fallback Pipeline failures? ├─ Task failed │ → Check TASK_HISTORY error_message │ → Is source stream stale? ├─ Snowpipe not loading │ → SYSTEM$PIPE_STATUS('pipe_name') │ → Check S3 event notifications └─ Dynamic table not refreshing → Check DYNAMIC_TABLE_REFRESH_HISTORY
Immediate Actions
Authentication Failure (P1)
-- Check login failures SELECT user_name, client_ip, error_code, error_message, event_timestamp FROM SNOWFLAKE.ACCOUNT_USAGE.LOGIN_HISTORY WHERE is_success = 'NO' AND event_timestamp >= DATEADD(minutes, -30, CURRENT_TIMESTAMP()) ORDER BY event_timestamp DESC; -- If key pair issue — verify public key assignment DESC USER svc_etl; -- Check RSA_PUBLIC_KEY and RSA_PUBLIC_KEY_2
Warehouse Suspended by Resource Monitor (P1)
-- Check resource monitor status SHOW RESOURCE MONITORS; -- Temporarily increase quota ALTER RESOURCE MONITOR prod_monitor SET CREDIT_QUOTA = 3000; -- Or switch to a different warehouse ALTER SESSION SET WAREHOUSE = BACKUP_WH;
Pipeline Failure — Stale Stream (P1)
-- If stream is stale, data between old and new offset is lost -- You must recreate the stream and backfill -- Check stream status SELECT * FROM TABLE(INFORMATION_SCHEMA.STREAMS()) WHERE stale = TRUE; -- Recreate stream DROP STREAM IF EXISTS orders_stream; CREATE STREAM orders_stream ON TABLE raw_orders; -- Backfill from Time Travel INSERT INTO dim_orders SELECT * FROM raw_orders AT (TIMESTAMP => '<last_known_good_timestamp>'::TIMESTAMP_NTZ) WHERE order_id NOT IN (SELECT order_id FROM dim_orders);
Rollback a Bad Deployment (P2)
-- Use Time Travel to restore table to pre-deployment state CREATE OR REPLACE TABLE prod_dw.silver.users CLONE prod_dw.silver.users AT (TIMESTAMP => '2026-03-22 08:00:00'::TIMESTAMP_NTZ); -- Or use UNDROP for accidentally dropped objects UNDROP TABLE prod_dw.silver.users; UNDROP SCHEMA prod_dw.silver; UNDROP DATABASE prod_dw; -- Suspend problematic tasks ALTER TASK transform_orders SUSPEND;
Communication Templates
Internal (Slack):
P1 INCIDENT: Snowflake [Category] Status: INVESTIGATING Impact: [Describe user/pipeline impact] Current action: [What you're doing now] Next update: [Time] Incident commander: @[name]
Postmortem Template:
## Incident: [Title] **Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** P[1-4] ### Summary [1-2 sentences] ### Timeline (UTC) - HH:MM — [Event/detection] - HH:MM — [Response action] - HH:MM — [Resolution] ### Root Cause [Technical explanation referencing specific error codes and query IDs] ### Impact - Pipelines affected: N - Data freshness delay: X hours - Credit overage: Y credits ### Action Items - [ ] [Preventive measure] — Owner — Due date
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't query ACCOUNT_USAGE | Missing privileges | Use ACCOUNTADMIN or grant IMPORTED PRIVILEGES |
| Time Travel expired | Past retention period | Cannot recover; increase retention proactively |
| Task won't resume | Dependency chain issue | Resume children first, then parent |
| Snowpipe backlog | S3 notification gap | Check SQS queue, run |
Resources
Next Steps
For data governance, see
snowflake-data-handling.