Learn-skills.dev audio-transcribe
Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.
install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agntswrm/agent-media/audio-transcribe" ~/.claude/skills/neversight-learn-skills-dev-audio-transcribe && rm -rf "$T"
manifest:
data/skills-md/agntswrm/agent-media/audio-transcribe/SKILL.mdsource content
Audio Transcribe
Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.
Command
agent-media audio transcribe --in <path> [options]
Inputs
| Option | Required | Description |
|---|---|---|
| Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) |
| No | Enable speaker identification |
| No | Language code (auto-detected if not provided) |
| No | Number of speakers hint for diarization |
| No | Output path, filename or directory (default: ./) |
| No | Provider to use (local, fal, replicate) |
Output
Returns a JSON object with transcription data:
{ "ok": true, "media_type": "audio", "action": "transcribe", "provider": "fal", "output_path": "transcription_123_abc.json", "transcription": { "text": "Full transcription text...", "language": "en", "segments": [ { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" }, { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" } ] } }
Examples
Basic transcription (auto-detect language):
agent-media audio transcribe --in interview.mp3
Transcription with speaker identification:
agent-media audio transcribe --in meeting.wav --diarize
Transcription with specific language and speaker count:
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3
Use specific provider:
agent-media audio transcribe --in audio.wav --provider replicate
Extracting Audio from Video
To transcribe a video file, first extract the audio:
# Step 1: Extract audio from video agent-media audio extract --in video.mp4 --format mp3 # Step 2: Transcribe the extracted audio agent-media audio transcribe --in extracted_xxx.mp3
Providers
local
Runs locally on CPU using Transformers.js, no API key required.
- Uses Moonshine model (5x faster than Whisper)
- Models downloaded on first use (~100MB)
- Does NOT support diarization — use fal or replicate for speaker identification
- You may see a
error — ignore it, the output is correct ifmutex lock failed"ok": true
agent-media audio transcribe --in audio.mp3 --provider local
fal
- Requires
FAL_API_KEY - Uses
model for fast transcription (2x faster) when diarization is disabledwizper - Uses
model when diarization is enabled (native support)whisper
replicate
- Requires
REPLICATE_API_TOKEN - Uses
model with Whisper Large V3 Turbowhisper-diarization - Native diarization support with word-level timestamps