Indic-voice-pipeline whisper-transcribe

install
source · Clone the upstream repo
git clone https://github.com/humancto/indic-voice-pipeline
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/humancto/indic-voice-pipeline "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/whisper-transcribe" ~/.claude/skills/humancto-indic-voice-pipeline-whisper-transcribe && rm -rf "$T"
manifest: skills/whisper-transcribe/SKILL.md
source content

Whisper Transcribe

Transcribe and translate audio/video files locally using OpenAI Whisper. Supports 99 languages, runs entirely on your machine.

Prerequisites

Run once to install dependencies:

pip install openai-whisper --quiet
pip install transformers accelerate --quiet  # For HuggingFace fine-tuned models

ffmpeg is required for audio processing:

brew install ffmpeg  # macOS

Step-by-Step Workflow

For ANY transcription/translation request, follow these steps:

Step 1: Check dependencies

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/check_deps.py

Step 2: Determine intent and run the appropriate command

User wants to transcribe audio/video to text:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE_PATH>" --output-dir ~/Downloads

User wants to translate audio/video to English:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE_PATH>" --output-dir ~/Downloads

User wants to detect the language:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE_PATH>"

User wants file info without transcribing:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE_PATH>"

Step 3: Report results

Tell the user:

  • Detected language and confidence
  • The full transcription text
  • Where output files were saved (text, SRT subtitles, JSON)
  • Processing time
  • If translated: both original language and English translation

All Commands

# Transcribe audio/video (auto-detects language, saves .txt + .srt + .json)
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --output-dir ~/Downloads

# Transcribe with a specific source language (faster, skips detection)
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te

# Transcribe with a larger model for better accuracy
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --model medium

# Transcribe with a specific HuggingFace fine-tuned model
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2"

# Translate any language to English
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --output-dir ~/Downloads

# Translate with known source language
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py translate "<FILE>" --language te

# Detect language of audio
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py detect "<FILE>"

# Show audio file metadata
/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py info "<FILE>"

Indian Language Fine-Tuned Models

The skill supports 12 Indian languages with fine-tuned Whisper models from two sources:

  • vasista22 (IIT Madras Speech Lab) — HuggingFace hosted, plug-and-play
  • AI4Bharat IndicWhisper — Downloaded as ZIP, cached locally at
    ~/.cache/indicwhisper/

Auto-routing: Just pass

--language <code>
— the best model is selected automatically:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --language te

Manual override: Use

--hf-model
to specify any HuggingFace Whisper model:

/usr/local/opt/python@3.11/bin/python3.11 ~/.claude/skills/whisper-transcribe/scripts/whisper_transcribe.py transcribe "<FILE>" --hf-model "vasista22/whisper-telugu-large-v2"

vasista22 Models (HuggingFace — auto-downloaded)

LanguageCodeModel
Telugu
te
vasista22/whisper-telugu-large-v2
Hindi
hi
vasista22/whisper-hindi-large-v2
Kannada
kn
vasista22/whisper-kannada-medium
Gujarati
gu
vasista22/whisper-gujarati-medium
Tamil
ta
vasista22/whisper-tamil-medium

Models by vasista22 (IIT Madras Speech Lab), funded by Bhashini / MeitY.

AI4Bharat IndicWhisper Models (ZIP download — cached locally)

These models are fine-tuned on Whisper-medium using the Vistaar dataset. First use downloads the model ZIP (~500-800 MB) and caches it at

~/.cache/indicwhisper/<language>/
.

LanguageCodeSource
Bengali
bn
IndicWhisper (AI4Bharat)
Malayalam
ml
IndicWhisper (AI4Bharat)
Marathi
mr
IndicWhisper (AI4Bharat)
Odia
or
IndicWhisper (AI4Bharat)
Punjabi
pa
IndicWhisper (AI4Bharat)
Sanskrit
sa
IndicWhisper (AI4Bharat)
Urdu
ur
IndicWhisper (AI4Bharat)

Models by AI4Bharat (IIT Madras), MIT licensed.

Priority

When a language has models from both sources (e.g. Hindi, Gujarati, Kannada, Tamil), the vasista22 HuggingFace model is preferred. IndicWhisper is used for languages not covered by vasista22.

Model Sizes

ModelSizeSpeedAccuracyBest for
tiny
39 MBFastestLowQuick drafts, clear speech
base
74 MBFastGoodDefault — good balance
small
244 MBModerateBetterNoisy audio, accented speech
medium
769 MBSlowGreatNon-English, complex audio
large
1.5 GBSlowestBestMaximum accuracy, rare languages

Supported Languages (selection)

English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese, Korean, Arabic, Hindi, Telugu, Tamil, Bengali, Turkish, Ukrainian, Vietnamese, Thai, Indonesian, Swedish, and 70+ more.

Important Notes

  • Default output location is
    ~/Downloads
  • All output is JSON to stdout, status messages go to stderr
  • Three output files per transcription:
    .txt
    (plain text),
    .srt
    (subtitles),
    .json
    (structured)
  • Works with both audio files (mp3, wav, m4a, ogg, flac) and video files (mp4, mkv, webm, mov)
  • Video files have audio automatically extracted before transcription
  • Translation always outputs English (this is a Whisper limitation)
  • First run downloads the model (~74 MB for base) — subsequent runs use cache
  • Runs 100% locally — no internet needed after model download, no API keys
  • Use
    --model medium
    or
    --model large
    for better accuracy on non-English or noisy audio