Claude-code-plugins-plus-skills notion-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/notion-pack/skills/notion-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-notion-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/notion-pack/skills/notion-incident-runbook/SKILL.mdsource content
Notion Incident Runbook
Overview
Rapid incident response procedures for Notion API failures. This runbook covers a structured triage flow (under 5 minutes), automated health checks against both status.notion.so and your own integration, a decision tree for classifying failures (Notion-side vs. integration-side), per-error-type mitigation with real
Client code, cached fallback patterns, communication templates, and postmortem structure.
Prerequisites
- Access to application monitoring dashboards and log aggregator
environment variable set for diagnostic API callsNOTION_TOKEN
andcurl
installed for quick CLI triagejq- Python alternative:
(notion-client
)pip install notion-client - Communication channels configured (Slack webhook, PagerDuty, etc.)
Instructions
Step 1: Quick Triage (Under 5 Minutes)
Run this diagnostic script to determine if the issue is Notion-side or integration-side:
#!/bin/bash # notion-triage.sh — run at first alert set -euo pipefail echo "=== Notion Incident Triage ===" echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)" # 1. Check Notion's public status page echo -e "\n--- Notion Platform Status ---" STATUS=$(curl -sf https://status.notion.so/api/v2/status.json \ | jq -r '.status.description' 2>/dev/null || echo "UNREACHABLE") echo "Notion Status: $STATUS" INCIDENTS=$(curl -sf https://status.notion.so/api/v2/incidents/unresolved.json \ | jq '.incidents | length' 2>/dev/null || echo "UNKNOWN") echo "Active Incidents: $INCIDENTS" if [ "$INCIDENTS" != "0" ] && [ "$INCIDENTS" != "UNKNOWN" ]; then echo "INCIDENT DETAILS:" curl -sf https://status.notion.so/api/v2/incidents/unresolved.json \ | jq -r '.incidents[] | " - \(.name) (\(.status)): \(.incident_updates[0].body)"' fi # 2. Test our integration authentication echo -e "\n--- Integration Auth Check ---" AUTH_HTTP=$(curl -sf -o /dev/null -w "%{http_code}" \ https://api.notion.com/v1/users/me \ -H "Authorization: Bearer ${NOTION_TOKEN}" \ -H "Notion-Version: 2022-06-28" 2>/dev/null || echo "000") echo "Auth HTTP Status: $AUTH_HTTP" if [ "$AUTH_HTTP" = "200" ]; then BOT_NAME=$(curl -sf https://api.notion.com/v1/users/me \ -H "Authorization: Bearer ${NOTION_TOKEN}" \ -H "Notion-Version: 2022-06-28" | jq -r '.name') echo "Bot Name: $BOT_NAME" fi # 3. Test database query (if test DB configured) echo -e "\n--- API Responsiveness ---" if [ -n "${NOTION_TEST_DATABASE_ID:-}" ]; then QUERY_RESULT=$(curl -sf -o /dev/null -w "%{http_code} %{time_total}s" \ -X POST "https://api.notion.com/v1/databases/${NOTION_TEST_DATABASE_ID}/query" \ -H "Authorization: Bearer ${NOTION_TOKEN}" \ -H "Notion-Version: 2022-06-28" \ -H "Content-Type: application/json" \ -d '{"page_size": 1}' 2>/dev/null || echo "000 0.000s") echo "Database Query: $QUERY_RESULT" else echo "NOTION_TEST_DATABASE_ID not set — skipping query test" fi # 4. Classification echo -e "\n--- Triage Result ---" if [ "$STATUS" != "All Systems Operational" ] && [ "$STATUS" != "UNREACHABLE" ]; then echo "CLASSIFICATION: Notion-side issue. Enable fallback mode." elif [ "$AUTH_HTTP" = "401" ]; then echo "CLASSIFICATION: Token expired or revoked. Rotate immediately." elif [ "$AUTH_HTTP" = "429" ]; then echo "CLASSIFICATION: Rate limited. Reduce concurrency." elif [ "$AUTH_HTTP" = "000" ]; then echo "CLASSIFICATION: Network/DNS issue. Check firewall and DNS." else echo "CLASSIFICATION: Integration-side issue. Check application logs." fi
TypeScript — programmatic triage:
import { Client, isNotionClientError, APIErrorCode } from '@notionhq/client'; async function triageNotionHealth(token: string): Promise<{ classification: string; notionStatus: string; authStatus: string; latencyMs: number; }> { // Check Notion status page let notionStatus = 'unknown'; try { const res = await fetch('https://status.notion.so/api/v2/status.json'); const data = await res.json(); notionStatus = data.status.description; } catch { notionStatus = 'unreachable'; } // Test our authentication const client = new Client({ auth: token, timeoutMs: 10_000 }); const start = Date.now(); let authStatus = 'unknown'; let classification = 'unknown'; try { await client.users.me({}); authStatus = 'authenticated'; classification = 'integration-side'; } catch (error) { if (isNotionClientError(error)) { authStatus = `${error.code} (HTTP ${error.status})`; switch (error.code) { case APIErrorCode.Unauthorized: classification = 'token-expired'; break; case APIErrorCode.RateLimited: classification = 'rate-limited'; break; case APIErrorCode.ServiceUnavailable: classification = 'notion-down'; break; default: classification = 'api-error'; } } else { authStatus = 'network-error'; classification = 'network-issue'; } } if (notionStatus !== 'All Systems Operational') { classification = 'notion-side'; } return { classification, notionStatus, authStatus, latencyMs: Date.now() - start, }; }
Step 2: Decision Tree and Mitigation
Is status.notion.so showing an incident? | +-- YES --> Notion-side outage | +-- Enable cached/fallback mode | +-- Notify users of degraded service | +-- Monitor status page for resolution | +-- DO NOT restart or rotate tokens | +-- NO --> Our integration issue | +-- Auth returning 401? | +-- YES --> Token expired or revoked | | +-- Regenerate at notion.so/my-integrations | | +-- Update secret manager (see below) | | +-- Restart application | +-- NO --> Continue | +-- Getting 429 rate limits? | +-- YES --> Exceeding 3 req/s average | | +-- Check for runaway loops or webhook storms | | +-- Reduce concurrency to 1 | | +-- Add exponential backoff | +-- NO --> Continue | +-- Getting 404 on specific resources? | +-- YES --> Pages unshared or deleted | | +-- Re-share pages with integration via Connections menu | | +-- Check if pages were moved to trash | +-- NO --> Continue | +-- Getting 400 validation errors? | +-- YES --> Database schema changed in Notion UI | | +-- Re-fetch schema (databases.retrieve) | | +-- Compare with expected properties | | +-- Update property mappings in code | +-- NO --> Investigate application logs
Token rotation:
# AWS Secrets Manager aws secretsmanager update-secret \ --secret-id notion/production \ --secret-string '{"token":"ntn_NEW_TOKEN_HERE"}' # GCP Secret Manager echo -n "ntn_NEW_TOKEN_HERE" | \ gcloud secrets versions add notion-token-prod --data-file=- # Restart to pick up new token kubectl rollout restart deployment/my-app # Kubernetes # or: gcloud run services update my-service --no-traffic # Cloud Run
Cached fallback for Notion outages:
import { Client, isNotionClientError } from '@notionhq/client'; const notion = new Client({ auth: process.env.NOTION_TOKEN! }); const cache = new Map<string, { data: any; timestamp: number }>(); const CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes async function queryWithFallback(dbId: string, filter?: any) { const cacheKey = `query:${dbId}:${JSON.stringify(filter)}`; try { const result = await notion.databases.query({ database_id: dbId, filter, page_size: 100, }); // Update cache on success cache.set(cacheKey, { data: result, timestamp: Date.now() }); return { data: result, source: 'live' as const }; } catch (error) { // Fall back to cache on any API error const cached = cache.get(cacheKey); if (cached && Date.now() - cached.timestamp < CACHE_TTL_MS) { console.warn(`Notion unavailable, serving cached data (age: ${ Math.round((Date.now() - cached.timestamp) / 1000) }s)`); return { data: cached.data, source: 'cache' as const }; } // No cache available — re-throw throw error; } } // Schema change detection async function detectSchemaChanges(dbId: string, expectedProps: string[]) { const db = await notion.databases.retrieve({ database_id: dbId }); const actualProps = Object.keys(db.properties); const missing = expectedProps.filter(p => !actualProps.includes(p)); const unexpected = actualProps.filter(p => !expectedProps.includes(p)); if (missing.length > 0 || unexpected.length > 0) { console.error(JSON.stringify({ event: 'schema_change_detected', database_id: dbId, missing_properties: missing, new_properties: unexpected, })); } return { missing, unexpected, current: actualProps }; }
Step 3: Communication and Postmortem
Internal Slack notification template:
:rotating_light: P[1-4] INCIDENT: Notion Integration Status: [INVESTIGATING | MITIGATING | RESOLVED] Impact: [specific user-facing impact] Root Cause: [Notion outage | Token expired | Rate limited | Schema change] Action: [current remediation step] ETA: [estimated resolution or "monitoring"] Dashboard: [link to monitoring dashboard] Thread: [link to incident channel thread]
External status page update:
Notion Integration Service Disruption We are experiencing [brief description of impact]. [Specific feature] may be unavailable or show stale data. Workaround: [if available, e.g., "Cached data is being served"] Next update: [time, e.g., "in 30 minutes or sooner if resolved"] [ISO 8601 timestamp]
Postmortem template:
## Incident: Notion [Error Type] — [Date] **Duration:** X hours Y minutes **Severity:** P[1-4] **Detection:** [Alert name] / [User report] ### Summary [1-2 sentence description of what happened and the user impact] ### Timeline (all times UTC) - HH:MM — First alert fired ([alert name]) - HH:MM — On-call acknowledged, began triage - HH:MM — Root cause identified: [description] - HH:MM — Mitigation applied: [action taken] - HH:MM — Service fully restored ### Root Cause [Technical explanation — e.g., "Integration token was rotated in Notion dashboard by a team member without updating the secret manager, causing all API calls to return 401 Unauthorized."] ### Impact - Users affected: N - Duration of degraded service: X minutes - Data loss: [none | description] ### Action Items | Priority | Action | Owner | Due | |----------|--------|-------|-----| | P1 | [Preventive measure] | @name | YYYY-MM-DD | | P2 | [Detection improvement] | @name | YYYY-MM-DD | | P3 | [Process improvement] | @name | YYYY-MM-DD |
Output
- Automated triage script classifying incidents in under 5 minutes
- Decision tree mapping HTTP status codes to root causes
- Per-error-type mitigation procedures with real code
- Cached fallback mode for Notion outages
- Schema change detection for 400 validation errors
- Communication templates for internal and external stakeholders
- Postmortem template with timeline and action items
Error Handling
| Scenario | Triage Signal | Immediate Action |
|---|---|---|
| Notion platform outage | status.notion.so incident | Enable fallback mode, notify users |
| Token expired/revoked | All requests return 401 | Rotate token in secret manager, restart |
| Rate limited | 429 errors spiking | Reduce concurrency to 1, check for loops |
| Schema changed | 400 on specific operations | Run , update mappings |
| Network/DNS issue | Timeouts, no HTTP response | Check firewall, DNS resolution, proxy config |
| Pages unshared | 404 on previously working pages | Re-share via Connections menu in Notion |
Examples
One-Line Health Check
curl -sf https://api.notion.com/v1/users/me \ -H "Authorization: Bearer ${NOTION_TOKEN}" \ -H "Notion-Version: 2022-06-28" \ | jq '{name: .name, type: .type}' \ || echo "UNHEALTHY: Notion API unreachable or auth failed"
Python Quick Triage
from notion_client import Client, APIResponseError import os def quick_triage(): try: client = Client(auth=os.environ["NOTION_TOKEN"], timeout_ms=10_000) me = client.users.me() print(f"OK: Connected as {me['name']}") except APIResponseError as e: print(f"ERROR: {e.code} (HTTP {e.status}): {e.message}") except Exception as e: print(f"NETWORK ERROR: {e}") quick_triage()
Resources
- Notion Status Page — real-time platform status
- Notion API Error Codes — full error reference
- Notion Request Limits — 3 req/s average
- Statuspage API — programmatic status checks
Next Steps
For data handling and privacy compliance, see
notion-data-handling.