Learn-skills.dev audio-transcribe

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agntswrm/agent-media/audio-transcribe" ~/.claude/skills/neversight-learn-skills-dev-audio-transcribe && rm -rf "$T"

manifest: data/skills-md/agntswrm/agent-media/audio-transcribe/SKILL.md

source content

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

agent-media audio transcribe --in <path> [options]

Inputs

Option	Required	Description
`--in`	Yes	Input audio file path or URL (supports mp3, wav, m4a, ogg)
`--diarize`	No	Enable speaker identification
`--language`	No	Language code (auto-detected if not provided)
`--speakers`	No	Number of speakers hint for diarization
`--out`	No	Output path, filename or directory (default: ./)
`--provider`	No	Provider to use (local, fal, replicate)

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

agent-media audio transcribe --in interview.mp3

Transcription with speaker identification:

agent-media audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a
```
mutex lock failed
```
error — ignore it, the output is correct if
```
"ok": true
```

agent-media audio transcribe --in audio.mp3 --provider local

fal

Requires
```
FAL_API_KEY
```
Uses
```
wizper
```
model for fast transcription (2x faster) when diarization is disabled
Uses
```
whisper
```
model when diarization is enabled (native support)

replicate

Requires
```
REPLICATE_API_TOKEN
```
Uses
```
whisper-diarization
```
model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps