Skillsbench Automatic Speech Recognition (ASR)

Transcribe audio segments to text using Whisper models. Use larger models (small, base, medium, large-v3) for better accuracy, or faster-whisper for optimized performance. Always align transcription timestamps with diarization segments for accurate speaker-labeled subtitles.

install

source · Clone the upstream repo

git clone https://github.com/benchflow-ai/skillsbench

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition" ~/.claude/skills/benchflow-ai-skillsbench-automatic-speech-recognition-asr && rm -rf "$T"

manifest: tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition/SKILL.md

source content

Automatic Speech Recognition (ASR)

Overview

After speaker diarization, you need to transcribe each speech segment to text. Whisper is the current state-of-the-art for ASR, with multiple model sizes offering different trade-offs between accuracy and speed.

When to Use

After speaker diarization is complete
Need to generate speaker-labeled transcripts
Creating subtitles from audio segments
Converting speech segments to text

Whisper Model Selection

Model Size Comparison

Model	Size	Speed	Accuracy	Best For
tiny	39M	Fastest	Lowest	Quick testing, low accuracy needs
base	74M	Fast	Low	Fast processing with moderate accuracy
small	244M	Medium	Good	Recommended balance
medium	769M	Slow	Very Good	High accuracy needs
large-v3	1550M	Slowest	Best	Maximum accuracy

Recommended: Use

small

large-v3

For best accuracy (recommended for this task):

import whisper

model = whisper.load_model("large-v3")  # Best accuracy
result = model.transcribe(audio_path)

For balanced performance:

import whisper

model = whisper.load_model("small")  # Good balance
result = model.transcribe(audio_path)

Faster-Whisper (Optimized Alternative)

For faster processing with similar accuracy, use

faster-whisper

from faster_whisper import WhisperModel

# Use small model with CPU int8 quantization
model = WhisperModel("small", device="cpu", compute_type="int8")

# Transcribe
segments, info = model.transcribe(audio_path, beam_size=5)

# Process segments
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Advantages:

Faster than standard Whisper
Lower memory usage with quantization
Similar accuracy to standard Whisper

Aligning Transcriptions with Diarization Segments

After diarization, you need to map Whisper transcriptions to speaker segments:

# After diarization, you have turns with speaker labels
turns = [
    {'start': 0.8, 'duration': 0.86, 'speaker': 'SPEAKER_01'},
    {'start': 5.34, 'duration': 0.21, 'speaker': 'SPEAKER_01'},
    # ...
]

# Run Whisper transcription
model = whisper.load_model("large-v3")
result = model.transcribe(audio_path)

# Map transcriptions to turns
transcripts = {}
for i, turn in enumerate(turns):
    turn_start = turn['start']
    turn_end = turn['start'] + turn['duration']
    
    # Find overlapping Whisper segments
    overlapping_text = []
    for seg in result['segments']:
        seg_start = seg['start']
        seg_end = seg['end']
        
        # Check if Whisper segment overlaps with diarization turn
        if seg_start < turn_end and seg_end > turn_start:
            overlapping_text.append(seg['text'].strip())
    
    # Combine overlapping segments
    transcripts[i] = ' '.join(overlapping_text) if overlapping_text else '[INAUDIBLE]'

Handling Empty or Inaudible Segments

# If no transcription found for a segment
if not overlapping_text:
    transcripts[i] = '[INAUDIBLE]'
    
# Or skip very short segments
if turn['duration'] < 0.3:
    transcripts[i] = '[INAUDIBLE]'

Language Detection

Whisper can auto-detect language, but you can also specify:

# Auto-detect (recommended)
result = model.transcribe(audio_path)

# Or specify language for better accuracy
result = model.transcribe(audio_path, language="en")

Best Practices

Use larger models for better accuracy:
```
small
```
minimum,
```
large-v3
```
for best results
Align timestamps carefully: Match Whisper segments with diarization turns
Handle overlaps: Multiple Whisper segments may overlap with one diarization turn
Handle gaps: Some diarization turns may have no corresponding transcription
Post-process text: Clean up punctuation, capitalization if needed

Common Issues

Low transcription accuracy: Use larger model (small → medium → large-v3)
Slow processing: Use faster-whisper or smaller model
Misaligned timestamps: Check time alignment between diarization and transcription
Missing transcriptions: Check for very short segments or silence

Integration with Subtitle Generation

After transcription, combine with speaker labels for subtitles:

def generate_subtitles_ass(turns, transcripts, output_path):
    # ... header code ...
    
    for i, turn in enumerate(turns):
        start_time = format_time(turn['start'])
        end_time = format_time(turn['start'] + turn['duration'])
        speaker = turn['speaker']
        text = transcripts.get(i, "[INAUDIBLE]")
        
        # Format: SPEAKER_XX: text
        f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{speaker}: {text}\n")

Performance Tips

For accuracy: Use
```
large-v3
```
model
For speed: Use
```
faster-whisper
```
with
```
small
```
model
For memory: Use
```
faster-whisper
```
with
```
int8
```
quantization
Batch processing: Process multiple segments together if possible