Skillsbench Automatic Speech Recognition (ASR)
Transcribe audio segments to text using Whisper models. Use larger models (small, base, medium, large-v3) for better accuracy, or faster-whisper for optimized performance. Always align transcription timestamps with diarization segments for accurate speaker-labeled subtitles.
git clone https://github.com/benchflow-ai/skillsbench
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition" ~/.claude/skills/benchflow-ai-skillsbench-automatic-speech-recognition-asr && rm -rf "$T"
tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition/SKILL.mdAutomatic Speech Recognition (ASR)
Overview
After speaker diarization, you need to transcribe each speech segment to text. Whisper is the current state-of-the-art for ASR, with multiple model sizes offering different trade-offs between accuracy and speed.
When to Use
- After speaker diarization is complete
- Need to generate speaker-labeled transcripts
- Creating subtitles from audio segments
- Converting speech segments to text
Whisper Model Selection
Model Size Comparison
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| tiny | 39M | Fastest | Lowest | Quick testing, low accuracy needs |
| base | 74M | Fast | Low | Fast processing with moderate accuracy |
| small | 244M | Medium | Good | Recommended balance |
| medium | 769M | Slow | Very Good | High accuracy needs |
| large-v3 | 1550M | Slowest | Best | Maximum accuracy |
Recommended: Use small
or large-v3
smalllarge-v3For best accuracy (recommended for this task):
import whisper model = whisper.load_model("large-v3") # Best accuracy result = model.transcribe(audio_path)
For balanced performance:
import whisper model = whisper.load_model("small") # Good balance result = model.transcribe(audio_path)
Faster-Whisper (Optimized Alternative)
For faster processing with similar accuracy, use
faster-whisper:
from faster_whisper import WhisperModel # Use small model with CPU int8 quantization model = WhisperModel("small", device="cpu", compute_type="int8") # Transcribe segments, info = model.transcribe(audio_path, beam_size=5) # Process segments for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Advantages:
- Faster than standard Whisper
- Lower memory usage with quantization
- Similar accuracy to standard Whisper
Aligning Transcriptions with Diarization Segments
After diarization, you need to map Whisper transcriptions to speaker segments:
# After diarization, you have turns with speaker labels turns = [ {'start': 0.8, 'duration': 0.86, 'speaker': 'SPEAKER_01'}, {'start': 5.34, 'duration': 0.21, 'speaker': 'SPEAKER_01'}, # ... ] # Run Whisper transcription model = whisper.load_model("large-v3") result = model.transcribe(audio_path) # Map transcriptions to turns transcripts = {} for i, turn in enumerate(turns): turn_start = turn['start'] turn_end = turn['start'] + turn['duration'] # Find overlapping Whisper segments overlapping_text = [] for seg in result['segments']: seg_start = seg['start'] seg_end = seg['end'] # Check if Whisper segment overlaps with diarization turn if seg_start < turn_end and seg_end > turn_start: overlapping_text.append(seg['text'].strip()) # Combine overlapping segments transcripts[i] = ' '.join(overlapping_text) if overlapping_text else '[INAUDIBLE]'
Handling Empty or Inaudible Segments
# If no transcription found for a segment if not overlapping_text: transcripts[i] = '[INAUDIBLE]' # Or skip very short segments if turn['duration'] < 0.3: transcripts[i] = '[INAUDIBLE]'
Language Detection
Whisper can auto-detect language, but you can also specify:
# Auto-detect (recommended) result = model.transcribe(audio_path) # Or specify language for better accuracy result = model.transcribe(audio_path, language="en")
Best Practices
- Use larger models for better accuracy:
minimum,small
for best resultslarge-v3 - Align timestamps carefully: Match Whisper segments with diarization turns
- Handle overlaps: Multiple Whisper segments may overlap with one diarization turn
- Handle gaps: Some diarization turns may have no corresponding transcription
- Post-process text: Clean up punctuation, capitalization if needed
Common Issues
- Low transcription accuracy: Use larger model (small → medium → large-v3)
- Slow processing: Use faster-whisper or smaller model
- Misaligned timestamps: Check time alignment between diarization and transcription
- Missing transcriptions: Check for very short segments or silence
Integration with Subtitle Generation
After transcription, combine with speaker labels for subtitles:
def generate_subtitles_ass(turns, transcripts, output_path): # ... header code ... for i, turn in enumerate(turns): start_time = format_time(turn['start']) end_time = format_time(turn['start'] + turn['duration']) speaker = turn['speaker'] text = transcripts.get(i, "[INAUDIBLE]") # Format: SPEAKER_XX: text f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{speaker}: {text}\n")
Performance Tips
- For accuracy: Use
modellarge-v3 - For speed: Use
withfaster-whisper
modelsmall - For memory: Use
withfaster-whisper
quantizationint8 - Batch processing: Process multiple segments together if possible