Skillshub deepgram-incident-runbook

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/deepgram-incident-runbook" ~/.claude/skills/comeonoliver-skillshub-deepgram-incident-runbook && rm -rf "$T"

manifest: skills/jeremylongshore/claude-code-plugins-plus-skills/deepgram-incident-runbook/SKILL.md

Deepgram Incident Runbook

Overview

Standardized incident response for Deepgram-related production issues. Includes automated triage script, severity classification (SEV1-SEV4), immediate mitigation actions, fallback activation, and post-incident review template.

Quick Reference

Resource	URL
Deepgram Status	https://status.deepgram.com
Deepgram Console	https://console.deepgram.com
Support Email	support@deepgram.com
Community	https://github.com/orgs/deepgram/discussions

Severity Classification

Level	Definition	Response Time	Example
SEV1	Complete outage, all transcriptions failing	Immediate	100% 5xx errors
SEV2	Major degradation, >50% error rate	< 15 min	Specific model failing
SEV3	Minor degradation, elevated latency	< 1 hour	P95 > 30s
SEV4	Single feature affected, cosmetic	< 24 hours	Diarization inaccurate

Instructions

Step 1: Automated Triage (First 5 Minutes)

#!/bin/bash
set -euo pipefail
echo "=== Deepgram Incident Triage ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""

# 1. Check Deepgram status page
echo "--- Status Page ---"
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://status.deepgram.com)
echo "Status page: HTTP $STATUS"

# 2. Test API connectivity
echo ""
echo "--- API Connectivity ---"
curl -s -w "\nHTTP: %{http_code} | Latency: %{time_total}s\n" \
  'https://api.deepgram.com/v1/projects' \
  -H "Authorization: Token $DEEPGRAM_API_KEY" | head -5

# 3. Test transcription
echo ""
echo "--- Transcription Test ---"
RESULT=$(curl -s -w "\n%{http_code}" \
  -X POST 'https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true' \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav"}')
HTTP_CODE=$(echo "$RESULT" | tail -1)
echo "Transcription: HTTP $HTTP_CODE"

# 4. Test multiple models
echo ""
echo "--- Model Tests ---"
for MODEL in nova-3 nova-2 base; do
  CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST "https://api.deepgram.com/v1/listen?model=$MODEL" \
    -H "Authorization: Token $DEEPGRAM_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"url":"https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav"}')
  echo "$MODEL: HTTP $CODE"
done

# 5. Query internal metrics (if available)
echo ""
echo "--- Internal Metrics ---"
curl -s localhost:3000/health 2>/dev/null || echo "Health endpoint unavailable"

Step 2: SEV1 Response (Complete Outage)

import { createClient } from '@deepgram/sdk';

class DeepgramFallbackService {
  private primaryClient: ReturnType<typeof createClient>;
  private isInFallbackMode = false;
  private failedRequests: Array<{ url: string; options: any; timestamp: Date }> = [];

  constructor(apiKey: string) {
    this.primaryClient = createClient(apiKey);
  }

  async transcribe(url: string, options: any) {
    if (this.isInFallbackMode) {
      return this.handleFallback(url, options);
    }

    try {
      const { result, error } = await this.primaryClient.listen.prerecorded.transcribeUrl(
        { url }, options
      );
      if (error) throw error;
      return { source: 'deepgram', result };
    } catch (err) {
      console.error('Deepgram failed, entering fallback mode');
      this.isInFallbackMode = true;
      return this.handleFallback(url, options);
    }
  }

  private handleFallback(url: string, options: any) {
    // Queue for later replay
    this.failedRequests.push({ url, options, timestamp: new Date() });
    console.warn(`Queued for replay: ${url} (${this.failedRequests.length} in queue)`);

    return {
      source: 'fallback',
      message: 'Transcription queued — Deepgram is currently unavailable',
      queuePosition: this.failedRequests.length,
    };
  }

  // Call when Deepgram recovers
  async replayQueue() {
    console.log(`Replaying ${this.failedRequests.length} queued requests...`);
    const queue = [...this.failedRequests];
    this.failedRequests = [];
    this.isInFallbackMode = false;

    for (const req of queue) {
      try {
        await this.transcribe(req.url, req.options);
        console.log(`Replayed: ${req.url}`);
      } catch (err: any) {
        console.error(`Replay failed: ${req.url} — ${err.message}`);
        this.failedRequests.push(req);
      }
    }
  }
}

Step 3: SEV2 Response (Partial Degradation)

async function mitigateSev2() {
  const client = createClient(process.env.DEEPGRAM_API_KEY!);
  const testUrl = 'https://static.deepgram.com/examples/Bueller-Life-moves-702702706.wav';

  console.log('=== SEV2 Mitigation ===');

  // Test each model to find working ones
  const models = ['nova-3', 'nova-2', 'base', 'whisper-large'] as const;
  const working: string[] = [];
  const broken: string[] = [];

  for (const model of models) {
    try {
      const { error } = await client.listen.prerecorded.transcribeUrl(
        { url: testUrl }, { model }
      );
      if (error) throw error;
      working.push(model);
      console.log(`  [OK] ${model}`);
    } catch {
      broken.push(model);
      console.log(`  [FAIL] ${model}`);
    }
  }

  // Recommended actions
  console.log(`\nWorking models: ${working.join(', ')}`);
  console.log(`Broken models: ${broken.join(', ')}`);

  if (working.length > 0) {
    console.log(`\nAction: Switch to ${working[0]} until ${broken.join(', ')} recovers`);
  } else {
    console.log('\nAction: All models failing — escalate to SEV1');
  }

  // Test features
  const features = [
    { name: 'diarize', opts: { diarize: true } },
    { name: 'smart_format', opts: { smart_format: true } },
    { name: 'utterances', opts: { utterances: true } },
  ];

  console.log('\n--- Feature Tests ---');
  for (const { name, opts } of features) {
    try {
      const { error } = await client.listen.prerecorded.transcribeUrl(
        { url: testUrl }, { model: working[0] ?? 'nova-3', ...opts }
      );
      console.log(`  [${error ? 'FAIL' : 'OK'}] ${name}`);
    } catch {
      console.log(`  [FAIL] ${name}`);
    }
  }
}

Step 4: SEV3/SEV4 Mitigation

// SEV3: Elevated latency — increase timeouts and enable aggressive retry
function configureSev3Mitigation() {
  return {
    timeout: 60000,          // Increase from 30s to 60s
    maxRetries: 5,           // Increase from 3 to 5
    model: 'nova-2',         // Fallback to proven model
    diarize: false,          // Disable to reduce processing
    smart_format: true,      // Keep basic formatting
    utterances: false,       // Disable to reduce processing
    summarize: false,        // Disable
    detect_topics: false,    // Disable
  };
}

// SEV4: Single feature broken — disable and continue
function configureSev4Mitigation(brokenFeature: string) {
  const overrides: Record<string, any> = {};
  overrides[brokenFeature] = false;
  console.log(`Disabled ${brokenFeature} — filing Deepgram support ticket`);
  return overrides;
}

Step 5: Post-Incident Review Template

## Deepgram Incident Report

**Date:** YYYY-MM-DD
**Duration:** HH:MM start — HH:MM end (X minutes total)
**Severity:** SEV1/2/3/4
**On-Call:** [Name]

### Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Alert fired: [alert name] |
| HH:MM | On-call acknowledged |
| HH:MM | Triage completed, classified as SEV[N] |
| HH:MM | Mitigation applied: [action taken] |
| HH:MM | Service restored |
| HH:MM | All-clear confirmed |

### Impact
- **Failed requests:** N
- **Affected users:** N
- **Revenue impact:** $X
- **SLA impact:** X minutes of downtime

### Root Cause
[Description of root cause — Deepgram outage / configuration issue / etc.]

### What Went Well
- [Item 1]
- [Item 2]

### What Could Be Improved
- [Item 1]
- [Item 2]

### Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action 1] | [Name] | YYYY-MM-DD |
| [Action 2] | [Name] | YYYY-MM-DD |

Step 6: Escalation Matrix

Level	Contact	When
L1	On-call engineer	Alert fires
L2	Team lead	15 min without resolution
L3	Deepgram support (support@deepgram.com)	Confirmed Deepgram-side issue
L4	Engineering director	SEV1 > 1 hour

Output

Automated triage script (bash, runs in <30s)
SEV1 fallback service with request queue and replay
SEV2 model/feature diagnosis and auto-fallback
SEV3/SEV4 mitigation configurations
Post-incident review template

Error Handling

Issue	Cause	Solution
Triage script can't reach Deepgram	Network or DNS	Check outbound HTTPS to api.deepgram.com
Fallback queue growing	Extended outage	Alert if queue > 1000, consider alternate STT
Replay failures	Audio URLs expired	Re-fetch audio from source before replay
Status page shows green but API fails	Partial outage not yet reflected	Report to Deepgram support immediately