Skillsbench whisper-transcription
Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word.
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/video-filler-word-remover/environment/skills/whisper-transcription" ~/.claude/skills/benchflow-ai-skillsbench-whisper-transcription && rm -rf "$T"
manifest:
tasks/video-filler-word-remover/environment/skills/whisper-transcription/SKILL.mdsource content
Whisper Transcription
OpenAI Whisper provides accurate speech-to-text with word-level timestamps.
Installation
pip install openai-whisper
Model Selection
Use the
model for fast transcription - it's sufficient for most tasks and runs much faster:tiny
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |
Recommendation: Start with
- it handles clear interview/podcast audio well.tiny
Basic Usage with Word Timestamps
import whisper import json def transcribe_with_timestamps(audio_path, output_path): """ Transcribe audio and get word-level timestamps. Args: audio_path: Path to audio/video file output_path: Path to save JSON output """ # Use tiny model for speed model = whisper.load_model("tiny") # Transcribe with word timestamps result = model.transcribe( audio_path, word_timestamps=True, language="en" # Specify language for better accuracy ) # Extract words with timestamps words = [] for segment in result["segments"]: if "words" in segment: for word_info in segment["words"]: words.append({ "word": word_info["word"].strip(), "start": word_info["start"], "end": word_info["end"] }) with open(output_path, "w") as f: json.dump(words, f, indent=2) return words
Detecting Specific Words
def find_words(transcription, target_words): """ Find specific words in transcription with their timestamps. Args: transcription: List of word dicts with 'word', 'start', 'end' target_words: Set of words to find (lowercase) Returns: List of matches with word and timestamp """ matches = [] target_lower = {w.lower() for w in target_words} for item in transcription: word = item["word"].lower().strip() # Remove punctuation for matching clean_word = ''.join(c for c in word if c.isalnum()) if clean_word in target_lower: matches.append({ "word": clean_word, "timestamp": item["start"] }) return matches
Complete Example: Find Filler Words
import whisper import json # Filler words to detect FILLER_WORDS = { "um", "uh", "hum", "hmm", "mhm", "like", "so", "well", "yeah", "okay", "basically", "actually", "literally" } def detect_fillers(audio_path, output_path): # Load tiny model (fast!) model = whisper.load_model("tiny") # Transcribe result = model.transcribe(audio_path, word_timestamps=True, language="en") # Find fillers fillers = [] for segment in result["segments"]: for word_info in segment.get("words", []): word = word_info["word"].lower().strip() clean = ''.join(c for c in word if c.isalnum()) if clean in FILLER_WORDS: fillers.append({ "word": clean, "timestamp": round(word_info["start"], 2) }) with open(output_path, "w") as f: json.dump(fillers, f, indent=2) return fillers # Usage detect_fillers("/root/input.mp4", "/root/annotations.json")
Audio Extraction (if needed)
Whisper can process video files directly, but for cleaner results:
# Extract audio as 16kHz mono WAV ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
Multi-Word Phrases
For detecting phrases like "you know" or "I mean":
def find_phrases(transcription, phrases): """Find multi-word phrases in transcription.""" matches = [] words = [w["word"].lower().strip() for w in transcription] for phrase in phrases: phrase_words = phrase.lower().split() phrase_len = len(phrase_words) for i in range(len(words) - phrase_len + 1): if words[i:i+phrase_len] == phrase_words: matches.append({ "word": phrase, "timestamp": transcription[i]["start"] }) return matches