Skillshub deepgram-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/deepgram-incident-runbook" ~/.claude/skills/comeonoliver-skillshub-deepgram-incident-runbook && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/deepgram-incident-runbook/SKILL.mdsource content
Deepgram Incident Runbook
Overview
Standardized incident response for Deepgram-related production issues. Includes automated triage script, severity classification (SEV1-SEV4), immediate mitigation actions, fallback activation, and post-incident review template.
Quick Reference
| Resource | URL |
|---|---|
| Deepgram Status | https://status.deepgram.com |
| Deepgram Console | https://console.deepgram.com |
| Support Email | support@deepgram.com |
| Community | https://github.com/orgs/deepgram/discussions |
Severity Classification
| Level | Definition | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, all transcriptions failing | Immediate | 100% 5xx errors |
| SEV2 | Major degradation, >50% error rate | < 15 min | Specific model failing |
| SEV3 | Minor degradation, elevated latency | < 1 hour | P95 > 30s |
| SEV4 | Single feature affected, cosmetic | < 24 hours | Diarization inaccurate |
Instructions
Step 1: Automated Triage (First 5 Minutes)
#!/bin/bash set -euo pipefail echo "=== Deepgram Incident Triage ===" echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" echo "" # 1. Check Deepgram status page echo "--- Status Page ---" STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://status.deepgram.com) echo "Status page: HTTP $STATUS" # 2. Test API connectivity echo "" echo "--- API Connectivity ---" curl -s -w "\nHTTP: %{http_code} | Latency: %{time_total}s\n" \ 'https://api.deepgram.com/v1/projects' \ -H "Authorization: Token $DEEPGRAM_API_KEY" | head -5 # 3. Test transcription echo "" echo "--- Transcription Test ---" RESULT=$(curl -s -w "\n%{http_code}" \ -X POST 'https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true' \ -H "Authorization: Token $DEEPGRAM_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url":"https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav"}') HTTP_CODE=$(echo "$RESULT" | tail -1) echo "Transcription: HTTP $HTTP_CODE" # 4. Test multiple models echo "" echo "--- Model Tests ---" for MODEL in nova-3 nova-2 base; do CODE=$(curl -s -o /dev/null -w "%{http_code}" \ -X POST "https://api.deepgram.com/v1/listen?model=$MODEL" \ -H "Authorization: Token $DEEPGRAM_API_KEY" \ -H "Content-Type: application/json" \ -d '{"url":"https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav"}') echo "$MODEL: HTTP $CODE" done # 5. Query internal metrics (if available) echo "" echo "--- Internal Metrics ---" curl -s localhost:3000/health 2>/dev/null || echo "Health endpoint unavailable"
Step 2: SEV1 Response (Complete Outage)
import { createClient } from '@deepgram/sdk'; class DeepgramFallbackService { private primaryClient: ReturnType<typeof createClient>; private isInFallbackMode = false; private failedRequests: Array<{ url: string; options: any; timestamp: Date }> = []; constructor(apiKey: string) { this.primaryClient = createClient(apiKey); } async transcribe(url: string, options: any) { if (this.isInFallbackMode) { return this.handleFallback(url, options); } try { const { result, error } = await this.primaryClient.listen.prerecorded.transcribeUrl( { url }, options ); if (error) throw error; return { source: 'deepgram', result }; } catch (err) { console.error('Deepgram failed, entering fallback mode'); this.isInFallbackMode = true; return this.handleFallback(url, options); } } private handleFallback(url: string, options: any) { // Queue for later replay this.failedRequests.push({ url, options, timestamp: new Date() }); console.warn(`Queued for replay: ${url} (${this.failedRequests.length} in queue)`); return { source: 'fallback', message: 'Transcription queued — Deepgram is currently unavailable', queuePosition: this.failedRequests.length, }; } // Call when Deepgram recovers async replayQueue() { console.log(`Replaying ${this.failedRequests.length} queued requests...`); const queue = [...this.failedRequests]; this.failedRequests = []; this.isInFallbackMode = false; for (const req of queue) { try { await this.transcribe(req.url, req.options); console.log(`Replayed: ${req.url}`); } catch (err: any) { console.error(`Replay failed: ${req.url} — ${err.message}`); this.failedRequests.push(req); } } } }
Step 3: SEV2 Response (Partial Degradation)
async function mitigateSev2() { const client = createClient(process.env.DEEPGRAM_API_KEY!); const testUrl = 'https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav'; console.log('=== SEV2 Mitigation ==='); // Test each model to find working ones const models = ['nova-3', 'nova-2', 'base', 'whisper-large'] as const; const working: string[] = []; const broken: string[] = []; for (const model of models) { try { const { error } = await client.listen.prerecorded.transcribeUrl( { url: testUrl }, { model } ); if (error) throw error; working.push(model); console.log(` [OK] ${model}`); } catch { broken.push(model); console.log(` [FAIL] ${model}`); } } // Recommended actions console.log(`\nWorking models: ${working.join(', ')}`); console.log(`Broken models: ${broken.join(', ')}`); if (working.length > 0) { console.log(`\nAction: Switch to ${working[0]} until ${broken.join(', ')} recovers`); } else { console.log('\nAction: All models failing — escalate to SEV1'); } // Test features const features = [ { name: 'diarize', opts: { diarize: true } }, { name: 'smart_format', opts: { smart_format: true } }, { name: 'utterances', opts: { utterances: true } }, ]; console.log('\n--- Feature Tests ---'); for (const { name, opts } of features) { try { const { error } = await client.listen.prerecorded.transcribeUrl( { url: testUrl }, { model: working[0] ?? 'nova-3', ...opts } ); console.log(` [${error ? 'FAIL' : 'OK'}] ${name}`); } catch { console.log(` [FAIL] ${name}`); } } }
Step 4: SEV3/SEV4 Mitigation
// SEV3: Elevated latency — increase timeouts and enable aggressive retry function configureSev3Mitigation() { return { timeout: 60000, // Increase from 30s to 60s maxRetries: 5, // Increase from 3 to 5 model: 'nova-2', // Fallback to proven model diarize: false, // Disable to reduce processing smart_format: true, // Keep basic formatting utterances: false, // Disable to reduce processing summarize: false, // Disable detect_topics: false, // Disable }; } // SEV4: Single feature broken — disable and continue function configureSev4Mitigation(brokenFeature: string) { const overrides: Record<string, any> = {}; overrides[brokenFeature] = false; console.log(`Disabled ${brokenFeature} — filing Deepgram support ticket`); return overrides; }
Step 5: Post-Incident Review Template
## Deepgram Incident Report **Date:** YYYY-MM-DD **Duration:** HH:MM start — HH:MM end (X minutes total) **Severity:** SEV1/2/3/4 **On-Call:** [Name] ### Timeline | Time (UTC) | Event | |------------|-------| | HH:MM | Alert fired: [alert name] | | HH:MM | On-call acknowledged | | HH:MM | Triage completed, classified as SEV[N] | | HH:MM | Mitigation applied: [action taken] | | HH:MM | Service restored | | HH:MM | All-clear confirmed | ### Impact - **Failed requests:** N - **Affected users:** N - **Revenue impact:** $X - **SLA impact:** X minutes of downtime ### Root Cause [Description of root cause — Deepgram outage / configuration issue / etc.] ### What Went Well - [Item 1] - [Item 2] ### What Could Be Improved - [Item 1] - [Item 2] ### Action Items | Action | Owner | Due Date | |--------|-------|----------| | [Action 1] | [Name] | YYYY-MM-DD | | [Action 2] | [Name] | YYYY-MM-DD |
Step 6: Escalation Matrix
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Alert fires |
| L2 | Team lead | 15 min without resolution |
| L3 | Deepgram support (support@deepgram.com) | Confirmed Deepgram-side issue |
| L4 | Engineering director | SEV1 > 1 hour |
Output
- Automated triage script (bash, runs in <30s)
- SEV1 fallback service with request queue and replay
- SEV2 model/feature diagnosis and auto-fallback
- SEV3/SEV4 mitigation configurations
- Post-incident review template
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Triage script can't reach Deepgram | Network or DNS | Check outbound HTTPS to api.deepgram.com |
| Fallback queue growing | Extended outage | Alert if queue > 1000, consider alternate STT |
| Replay failures | Audio URLs expired | Re-fetch audio from source before replay |
| Status page shows green but API fails | Partial outage not yet reflected | Report to Deepgram support immediately |