Skillsbench Automatic Speech Recognition (ASR)

Transcribe audio segments to text using Whisper models. Use larger models (small, base, medium, large-v3) for better accuracy, or faster-whisper for optimized performance. Always align transcription timestamps with diarization segments for accurate speaker-labeled subtitles.

install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition" ~/.claude/skills/benchflow-ai-skillsbench-automatic-speech-recognition-asr && rm -rf "$T"
manifest: tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition/SKILL.md
source content

Automatic Speech Recognition (ASR)

Overview

After speaker diarization, you need to transcribe each speech segment to text. Whisper is the current state-of-the-art for ASR, with multiple model sizes offering different trade-offs between accuracy and speed.

When to Use

  • After speaker diarization is complete
  • Need to generate speaker-labeled transcripts
  • Creating subtitles from audio segments
  • Converting speech segments to text

Whisper Model Selection

Model Size Comparison

ModelSizeSpeedAccuracyBest For
tiny39MFastestLowestQuick testing, low accuracy needs
base74MFastLowFast processing with moderate accuracy
small244MMediumGoodRecommended balance
medium769MSlowVery GoodHigh accuracy needs
large-v31550MSlowestBestMaximum accuracy

Recommended: Use
small
or
large-v3

For best accuracy (recommended for this task):

import whisper

model = whisper.load_model("large-v3")  # Best accuracy
result = model.transcribe(audio_path)

For balanced performance:

import whisper

model = whisper.load_model("small")  # Good balance
result = model.transcribe(audio_path)

Faster-Whisper (Optimized Alternative)

For faster processing with similar accuracy, use

faster-whisper
:

from faster_whisper import WhisperModel

# Use small model with CPU int8 quantization
model = WhisperModel("small", device="cpu", compute_type="int8")

# Transcribe
segments, info = model.transcribe(audio_path, beam_size=5)

# Process segments
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Advantages:

  • Faster than standard Whisper
  • Lower memory usage with quantization
  • Similar accuracy to standard Whisper

Aligning Transcriptions with Diarization Segments

After diarization, you need to map Whisper transcriptions to speaker segments:

# After diarization, you have turns with speaker labels
turns = [
    {'start': 0.8, 'duration': 0.86, 'speaker': 'SPEAKER_01'},
    {'start': 5.34, 'duration': 0.21, 'speaker': 'SPEAKER_01'},
    # ...
]

# Run Whisper transcription
model = whisper.load_model("large-v3")
result = model.transcribe(audio_path)

# Map transcriptions to turns
transcripts = {}
for i, turn in enumerate(turns):
    turn_start = turn['start']
    turn_end = turn['start'] + turn['duration']
    
    # Find overlapping Whisper segments
    overlapping_text = []
    for seg in result['segments']:
        seg_start = seg['start']
        seg_end = seg['end']
        
        # Check if Whisper segment overlaps with diarization turn
        if seg_start < turn_end and seg_end > turn_start:
            overlapping_text.append(seg['text'].strip())
    
    # Combine overlapping segments
    transcripts[i] = ' '.join(overlapping_text) if overlapping_text else '[INAUDIBLE]'

Handling Empty or Inaudible Segments

# If no transcription found for a segment
if not overlapping_text:
    transcripts[i] = '[INAUDIBLE]'
    
# Or skip very short segments
if turn['duration'] < 0.3:
    transcripts[i] = '[INAUDIBLE]'

Language Detection

Whisper can auto-detect language, but you can also specify:

# Auto-detect (recommended)
result = model.transcribe(audio_path)

# Or specify language for better accuracy
result = model.transcribe(audio_path, language="en")

Best Practices

  1. Use larger models for better accuracy:
    small
    minimum,
    large-v3
    for best results
  2. Align timestamps carefully: Match Whisper segments with diarization turns
  3. Handle overlaps: Multiple Whisper segments may overlap with one diarization turn
  4. Handle gaps: Some diarization turns may have no corresponding transcription
  5. Post-process text: Clean up punctuation, capitalization if needed

Common Issues

  1. Low transcription accuracy: Use larger model (small → medium → large-v3)
  2. Slow processing: Use faster-whisper or smaller model
  3. Misaligned timestamps: Check time alignment between diarization and transcription
  4. Missing transcriptions: Check for very short segments or silence

Integration with Subtitle Generation

After transcription, combine with speaker labels for subtitles:

def generate_subtitles_ass(turns, transcripts, output_path):
    # ... header code ...
    
    for i, turn in enumerate(turns):
        start_time = format_time(turn['start'])
        end_time = format_time(turn['start'] + turn['duration'])
        speaker = turn['speaker']
        text = transcripts.get(i, "[INAUDIBLE]")
        
        # Format: SPEAKER_XX: text
        f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{speaker}: {text}\n")

Performance Tips

  1. For accuracy: Use
    large-v3
    model
  2. For speed: Use
    faster-whisper
    with
    small
    model
  3. For memory: Use
    faster-whisper
    with
    int8
    quantization
  4. Batch processing: Process multiple segments together if possible