Skills whisper
git clone https://github.com/TerminalSkills/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/whisper" ~/.claude/skills/terminalskills-skills-whisper && rm -rf "$T"
skills/whisper/SKILL.md- pip install
Whisper
Overview
Transcribe audio with OpenAI's Whisper — the state-of-the-art speech recognition model. This skill covers local Whisper (Python), faster-whisper (CTranslate2, 4x faster), whisper.cpp (CPU-optimized C++), and the OpenAI Whisper API. Includes subtitle generation (SRT/VTT/JSON), multi-language transcription, translation to English, speaker diarization, word-level timestamps, and production pipeline patterns for podcasts, meetings, and video subtitles.
Instructions
Step 1: Choose Your Runtime
Option A — OpenAI Whisper (original Python):
pip install openai-whisper # Models: tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.5G)
Option B — faster-whisper (recommended for local, 4x faster):
pip install faster-whisper # Uses CTranslate2 — INT8 quantization, runs well on CPU
Option C — whisper.cpp (best for CPU, minimal dependencies):
git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp && make # Download model bash models/download-ggml-model.sh base.en
Option D — OpenAI API (no local GPU needed):
pip install openai export OPENAI_API_KEY="sk-..."
Step 2: Basic Transcription
faster-whisper (recommended):
from faster_whisper import WhisperModel model = WhisperModel("base", device="cpu", compute_type="int8") # GPU: model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, info = model.transcribe("episode.mp3", beam_size=5) print(f"Language: {info.language} (prob: {info.language_probability:.2f})") for segment in segments: print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
OpenAI Whisper (original):
import whisper model = whisper.load_model("base") # tiny, base, small, medium, large-v3 result = model.transcribe("episode.mp3") print(result["text"]) for segment in result["segments"]: print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")
whisper.cpp (CLI):
./main -m models/ggml-base.en.bin -f episode.wav -otxt -osrt -ovtt # Outputs: episode.txt, episode.srt, episode.vtt
OpenAI API:
from openai import OpenAI client = OpenAI() with open("episode.mp3", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-1", file=f, response_format="verbose_json", timestamp_granularities=["segment", "word"], ) print(transcript.text) for segment in transcript.segments: print(f"[{segment.start:.1f}s] {segment.text}")
Step 3: Subtitle Generation (SRT/VTT)
Generate SRT with faster-whisper:
from faster_whisper import WhisperModel model = WhisperModel("small", device="cpu", compute_type="int8") segments, info = model.transcribe("episode.mp3", beam_size=5) def format_timestamp(seconds): h = int(seconds // 3600) m = int((seconds % 3600) // 60) s = int(seconds % 60) ms = int((seconds % 1) * 1000) return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}" with open("episode.srt", "w") as f: for i, seg in enumerate(segments, 1): f.write(f"{i}\n") f.write(f"{format_timestamp(seg.start)} --> {format_timestamp(seg.end)}\n") f.write(f"{seg.text.strip()}\n\n") print("Generated episode.srt")
Generate VTT (for HTML5 video):
with open("episode.vtt", "w") as f: f.write("WEBVTT\n\n") for i, seg in enumerate(segments, 1): start = format_timestamp(seg.start).replace(",", ".") end = format_timestamp(seg.end).replace(",", ".") f.write(f"{start} --> {end}\n") f.write(f"{seg.text.strip()}\n\n")
Word-level timestamps (for karaoke-style subtitles):
segments, info = model.transcribe("episode.mp3", word_timestamps=True) for segment in segments: for word in segment.words: print(f" [{word.start:.2f}s → {word.end:.2f}s] {word.word}")
Step 4: Language Detection & Translation
# Auto-detect language segments, info = model.transcribe("foreign_audio.mp3") print(f"Detected: {info.language} ({info.language_probability:.0%})") # Force specific language segments, info = model.transcribe("german.mp3", language="de") # Translate to English (any language → English) segments, info = model.transcribe("german.mp3", task="translate") for seg in segments: print(seg.text) # English translation
Supported languages: 99 languages including en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, nl, ar, sv, it, hi, and many more.
Step 5: Speaker Diarization
Combine Whisper with pyannote-audio for "who said what":
from faster_whisper import WhisperModel from pyannote.audio import Pipeline import torch # Transcribe model = WhisperModel("small", device="cpu", compute_type="int8") segments, info = model.transcribe("meeting.wav", beam_size=5) whisper_segments = list(segments) # Diarize (requires HuggingFace token with pyannote access) diarization = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="hf_YOUR_TOKEN" ) diarization_result = diarization("meeting.wav") # Merge: assign speaker to each whisper segment def get_speaker(start, end, diarization_result): """Find the dominant speaker during a time range.""" speakers = {} for turn, _, speaker in diarization_result.itertracks(yield_label=True): overlap_start = max(start, turn.start) overlap_end = min(end, turn.end) if overlap_start < overlap_end: duration = overlap_end - overlap_start speakers[speaker] = speakers.get(speaker, 0) + duration return max(speakers, key=speakers.get) if speakers else "Unknown" for seg in whisper_segments: speaker = get_speaker(seg.start, seg.end, diarization_result) print(f"[{speaker}] [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")
Step 6: Batch Processing & Pipelines
Transcribe all episodes in a directory:
import os from pathlib import Path from faster_whisper import WhisperModel model = WhisperModel("small", device="cpu", compute_type="int8") episodes_dir = Path("episodes") output_dir = Path("transcripts") output_dir.mkdir(exist_ok=True) for audio_file in sorted(episodes_dir.glob("*.mp3")): print(f"Transcribing: {audio_file.name}") segments, info = model.transcribe(str(audio_file), beam_size=5) # Plain text txt_path = output_dir / f"{audio_file.stem}.txt" with open(txt_path, "w") as f: for seg in segments: f.write(seg.text.strip() + "\n") # SRT srt_path = output_dir / f"{audio_file.stem}.srt" segments2, _ = model.transcribe(str(audio_file), beam_size=5) # Re-iterate with open(srt_path, "w") as f: for i, seg in enumerate(segments2, 1): h, m = divmod(int(seg.start), 3600) m, s = divmod(m, 60) ms = int((seg.start % 1) * 1000) start_ts = f"{h:02d}:{m:02d}:{s:02d},{ms:03d}" h2, m2 = divmod(int(seg.end), 3600) m2, s2 = divmod(m2, 60) ms2 = int((seg.end % 1) * 1000) end_ts = f"{h2:02d}:{m2:02d}:{s2:02d},{ms2:03d}" f.write(f"{i}\n{start_ts} --> {end_ts}\n{seg.text.strip()}\n\n") print(f" → {txt_path}, {srt_path}")
Step 7: Model Selection Guide
| Model | Size | VRAM | Speed (CPU) | Accuracy | Best For |
|---|---|---|---|---|---|
| tiny | 39M | ~1GB | Very fast | Low | Quick drafts, real-time |
| base | 74M | ~1GB | Fast | Medium | Good balance for CPU |
| small | 244M | ~2GB | Moderate | Good | Podcasts, clear audio |
| medium | 769M | ~5GB | Slow | Very good | Noisy audio, accents |
| large-v3 | 1.5G | ~10GB | Very slow | Best | Production quality |
Recommendations:
- CPU-only, speed matters:
ortiny
with faster-whisperbase - CPU-only, accuracy matters:
with faster-whispersmall - GPU available:
with faster-whisper (large-v3
)float16 - No local compute: OpenAI API (
)whisper-1
Examples
Example 1: Transcribe a podcast season and generate SRT subtitles
User prompt: "Transcribe all 20 MP3 episodes in ./episodes/ and generate both plain text transcripts and SRT subtitle files. I have a GPU with 10GB VRAM."
The agent will:
- Install faster-whisper via
.pip install faster-whisper - Load the
model withlarge-v3
anddevice="cuda"
to leverage the GPU.compute_type="float16" - Create a
output directory../transcripts/ - Loop over all
files in.mp3
, transcribing each with./episodes/
.beam_size=5 - Write both a
file (plain text) and a.txt
file (with properly formatted timestamps) for each episode..srt - Report the detected language and total processing time per episode.
Example 2: Translate a German interview to English with speaker labels
User prompt: "I have a 45-minute German interview recording at meeting.wav with two speakers. Transcribe it in English and label who said what."
The agent will:
- Install faster-whisper and pyannote-audio (
).pip install faster-whisper pyannote.audio - Transcribe
withmeeting.wav
to get English text from the German audio.task="translate" - Run pyannote speaker diarization to identify speaker segments (requires a HuggingFace token with pyannote model access).
- Merge Whisper segments with diarization results by matching time ranges to assign speaker labels.
- Output a formatted transcript with
and[Speaker 1]
labels before each segment.[Speaker 2]
Guidelines
- Use faster-whisper over the original OpenAI Whisper for local transcription; it is 4x faster and uses less memory through INT8 quantization via CTranslate2.
- Select model size based on your hardware:
/tiny
for CPU speed,base
for CPU accuracy,small
for GPU production quality.large-v3 - Always convert audio to WAV or ensure ffmpeg is installed when working with MP3/M4A inputs; Whisper relies on ffmpeg for non-WAV format decoding.
- For speaker diarization, the pyannote pipeline requires a HuggingFace access token with accepted model terms; set this up before attempting multi-speaker transcription.
- When generating SRT files, use
for more accurate segment boundaries; the default greedy decoding can produce poorly timed subtitle breaks.beam_size=5