Indic-voice-pipeline whisper-transcribe
git clone https://github.com/humancto/indic-voice-pipeline
T=$(mktemp -d) && git clone --depth=1 https://github.com/humancto/indic-voice-pipeline "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/whisper-transcribe" ~/.claude/skills/humancto-indic-voice-pipeline-whisper-transcribe && rm -rf "$T"
skills/whisper-transcribe/SKILL.mdWhisper Transcribe
Transcribe and translate audio/video files locally using OpenAI Whisper. Supports 99 languages, runs entirely on your machine.
Prerequisites
Run once to install dependencies:
pip install openai-whisper --quiet pip install transformers accelerate --quiet # For HuggingFace fine-tuned models
ffmpeg is required for audio processing:
brew install ffmpeg # macOS
Step-by-Step Workflow
For ANY transcription/translation request, follow these steps:
Step 1: Check dependencies
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/check_deps.py
Step 2: Determine intent and run the appropriate command
User wants to transcribe audio/video to text:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE_PATH>" --output-dir ~/Downloads
User wants to translate audio/video to English:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE_PATH>" --output-dir ~/Downloads
User wants to detect the language:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE_PATH>"
User wants file info without transcribing:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE_PATH>"
Step 3: Report results
Tell the user:
- Detected language and confidence
- The full transcription text
- Where output files were saved (text, SRT subtitles, JSON)
- Processing time
- If translated: both original language and English translation
All Commands
# Transcribe audio/video (auto-detects language, saves .txt + .srt + .json) /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --output-dir ~/Downloads # Transcribe with a specific source language (faster, skips detection) /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te # Transcribe with a larger model for better accuracy /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --model medium # Transcribe with a specific HuggingFace fine-tuned model /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2" # Translate any language to English /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --output-dir ~/Downloads # Translate with known source language /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --language te # Detect language of audio /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE>" # Show audio file metadata /usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE>"
Indian Language Fine-Tuned Models
The skill supports 12 Indian languages with fine-tuned Whisper models from two sources:
- vasista22 (IIT Madras Speech Lab) — HuggingFace hosted, plug-and-play
- AI4Bharat IndicWhisper — Downloaded as ZIP, cached locally at
~/.cache/indicwhisper/
Auto-routing: Just pass
--language <code> — the best model is selected automatically:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te
Manual override: Use
--hf-model to specify any HuggingFace Whisper model:
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2"
vasista22 Models (HuggingFace — auto-downloaded)
| Language | Code | Model |
|---|---|---|
| Telugu | | |
| Hindi | | |
| Kannada | | |
| Gujarati | | |
| Tamil | | |
Models by vasista22 (IIT Madras Speech Lab), funded by Bhashini / MeitY.
AI4Bharat IndicWhisper Models (ZIP download — cached locally)
These models are fine-tuned on Whisper-medium using the Vistaar dataset. First use downloads the model ZIP (~500-800 MB) and caches it at
~/.cache/indicwhisper/<language>/.
| Language | Code | Source |
|---|---|---|
| Bengali | | IndicWhisper (AI4Bharat) |
| Malayalam | | IndicWhisper (AI4Bharat) |
| Marathi | | IndicWhisper (AI4Bharat) |
| Odia | | IndicWhisper (AI4Bharat) |
| Punjabi | | IndicWhisper (AI4Bharat) |
| Sanskrit | | IndicWhisper (AI4Bharat) |
| Urdu | | IndicWhisper (AI4Bharat) |
Models by AI4Bharat (IIT Madras), MIT licensed.
Priority
When a language has models from both sources (e.g. Hindi, Gujarati, Kannada, Tamil), the vasista22 HuggingFace model is preferred. IndicWhisper is used for languages not covered by vasista22.
Model Sizes
| Model | Size | Speed | Accuracy | Best for |
|---|---|---|---|---|
| 39 MB | Fastest | Low | Quick drafts, clear speech |
| 74 MB | Fast | Good | Default — good balance |
| 244 MB | Moderate | Better | Noisy audio, accented speech |
| 769 MB | Slow | Great | Non-English, complex audio |
| 1.5 GB | Slowest | Best | Maximum accuracy, rare languages |
Supported Languages (selection)
English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Telugu, Tamil, Bengali, Turkish, Ukrainian, Vietnamese, Thai, Indonesian, Swedish, and 70+ more.
Important Notes
- Default output location is
~/Downloads - All output is JSON to stdout, status messages go to stderr
- Three output files per transcription:
(plain text),.txt
(subtitles),.srt
(structured).json - Works with both audio files (mp3, wav, m4a, ogg, flac) and video files (mp4, mkv, webm, mov)
- Video files have audio automatically extracted before transcription
- Translation always outputs English (this is a Whisper limitation)
- First run downloads the model (~74 MB for base) — subsequent runs use cache
- Runs 100% locally — no internet needed after model download, no API keys
- Use
or--model medium
for better accuracy on non-English or noisy audio--model large