Claude-code-skills asr-transcribe-to-text

Transcribes audio and video files to text using Qwen3-ASR. Supports two modes — local MLX inference on macOS Apple Silicon (no API key, 15-27x realtime) and remote API via vLLM/OpenAI-compatible endpoints. Auto-detects platform and recommends the best path. Triggers when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字. Also triggers for meeting recordings, lectures, interviews, podcasts, screen recordings, or any audio/video file the user wants converted to text.

install

source · Clone the upstream repo

git clone https://github.com/daymade/claude-code-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/daymade/claude-code-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/asr-transcribe-to-text" ~/.claude/skills/daymade-claude-code-skills-asr-transcribe-to-text && rm -rf "$T"

manifest: asr-transcribe-to-text/SKILL.md

source content

ASR Transcribe to Text

Transcribe audio/video files to text using Qwen3-ASR. Two inference paths:

Mode	When	Speed	Cost
Local MLX	macOS Apple Silicon	15-27x realtime	Free
Remote API	Any platform, or when local unavailable	Depends on GPU	API/self-hosted

Configuration persists in

${CLAUDE_PLUGIN_DATA}/config.json

Step 0: Detect Platform and Load Config

cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null

If config exists, read values and proceed to Step 1.

If config does not exist, auto-detect platform first:

python3 -c "
import sys, platform
is_mac_arm = sys.platform == 'darwin' and platform.machine() in ('arm64', 'aarch64')
print(f'Platform: {sys.platform} {platform.machine()}')
print(f'Apple Silicon: {is_mac_arm}')
if is_mac_arm:
    print('RECOMMEND: local-mlx')
else:
    print('RECOMMEND: remote-api')
"

Then use AskUserQuestion with platform-aware defaults:

For macOS Apple Silicon (recommended: local):

ASR setup — your Mac has Apple Silicon, so local transcription is recommended.

Q1: Transcription mode?
  A) Local MLX — runs on your Mac's GPU, no API key needed, 15-27x realtime (Recommended)
  B) Remote API — send audio to a server (vLLM, Tailscale workstation, etc.)

Q2: Does your network have an HTTP proxy that might intercept traffic?
  A) Yes — bypass proxy for ASR traffic (Recommended if using Shadowrocket/Clash)
  B) No — direct connection

For other platforms (recommended: remote):

ASR setup — local MLX requires macOS Apple Silicon. Using remote API mode.

Q1: ASR Endpoint URL?
  A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Qwen3-ASR vLLM via Tailscale)
  B) http://localhost:8002/v1/audio/transcriptions (Local server)
  C) Custom URL

Q2: Proxy bypass needed?
  A) Yes (Recommended for Shadowrocket/Clash/corporate proxy)
  B) No

Save config:

mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
    'mode': 'MODE',           # 'local-mlx' or 'remote-api'
    'model': 'MODEL_ID',      # local: 'mlx-community/Qwen3-ASR-1.7B-8bit', remote: 'Qwen/Qwen3-ASR-1.7B'
    'max_tokens': 200000,     # local only, critical for long audio
    'endpoint': 'URL',        # remote only
    'noproxy': True,
    'max_timeout': 900        # remote only
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
    json.dump(config, f, indent=2)
print('Config saved.')
"

Step 1: Extract Audio (if input is video)

For video files (mp4, mov, mkv, avi, webm), extract as 16kHz mono WAV:

ffmpeg -i INPUT_VIDEO -vn -acodec pcm_s16le -ar 16000 -ac 1 OUTPUT.wav -y

Audio files (wav, mp3, m4a, flac, ogg) can be used directly. Get duration:

ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE

Cleanup: After transcription succeeds, delete extracted WAV files to save disk space.

Step 2: Transcribe

Path A: Local MLX (macOS Apple Silicon)

Use the bundled script — it handles model loading, chunking, and the critical

max_tokens

parameter:

uv run ${CLAUDE_PLUGIN_ROOT}/scripts/transcribe_local_mlx.py \
  INPUT_AUDIO [INPUT_AUDIO2 ...] \
  --output-dir OUTPUT_DIR

The script loads the model once and transcribes all files sequentially (no GPU contention). For details on performance, model compatibility, and the max_tokens truncation issue, see

references/local_mlx_guide.md

Critical: The upstream

mlx-audio

default

max_tokens=8192

silently truncates audio longer than ~40 minutes. The bundled script defaults to

. If calling

model.generate()

directly, always pass

max_tokens=200000

Path B: Remote API

Health check first (skip if already verified this session):

python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
    ['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
    capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
    print(f'HEALTH CHECK FAILED: {base}/models', file=sys.stderr)
    sys.exit(1)
print(f'Service healthy: {base}')
"

Read config and send via curl:

python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
    cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH'
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')

result = subprocess.run(
    ['curl', '-s', '--max-time', timeout] + noproxy + [
        cfg['endpoint'],
        '-F', f'file=@{audio_file}',
        '-F', f'model={cfg[\"model\"]}',
        '-o', output_json
    ], capture_output=True, text=True
)

with open(output_json) as f:
    data = json.load(f)
if 'text' not in data:
    print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
    sys.exit(1)
text = data['text']
print(f'Transcribed: {len(text)} chars', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt

If remote health check fails, diagnose in order:

Network:

ping -c 1 HOST

tailscale status | grep HOST

Service:

tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"

Proxy: retry with
```
--noproxy '*'
```
toggled

Step 3: Verify Output

After transcription, check for truncation — the most common failure mode:

Confirm output is not empty
Check character count is plausible (~400 chars/min for Chinese, ~200 words/min for English)
Check the ending — does it trail off mid-sentence? If so,
```
max_tokens
```
was exhausted
Show user the first and last ~200 characters as preview

If truncated or wrong, use AskUserQuestion:

Transcription may be truncated:
- Expected: ~[N] chars for [M] minutes of audio
- Got: [actual] chars ([pct]% of expected)
- Last line: "[last 100 chars...]"

Options:
A) Retry with higher max_tokens (current: [N], try: [N*2])
B) Switch mode — try [local/remote] instead
C) Save as-is — the output looks complete to me
D) Abort

Step 4: Fallback — Overlap-Merge (Remote API Only)

If single remote request fails (timeout, OOM), fall back to chunked transcription:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
  --config "${CLAUDE_PLUGIN_DATA}/config.json" \
  INPUT_AUDIO OUTPUT.txt

Splits into 18-minute chunks with 2-minute overlap, merges using punctuation-stripped fuzzy matching. See

references/overlap_merge_strategy.md

for algorithm details.

For local MLX mode, overlap-merge is unnecessary — the bundled script handles chunking internally with

max_tokens=200000

Step 5: Recommend Transcript Correction

ASR output always contains recognition errors — homophones, garbled technical terms, broken sentences. After successful transcription, proactively suggest running the

transcript-fixer

skill on the output:

Transcription complete: [N] chars saved to [output_path].

ASR output typically contains recognition errors (homophones, garbled terms, broken sentences).
Would you like me to run /transcript-fixer to clean up the text?

Options:
A) Yes — run transcript-fixer on the output now (Recommended)
B) No — the raw transcription is good enough for my needs
C) Later — I'll run it myself when ready

If the user chooses A, invoke the

transcript-fixer

skill with the output file path. The two skills form a natural pipeline: transcribe → correct → review.

Reconfigure

rm "${CLAUDE_PLUGIN_DATA}/config.json"

Then re-run Step 0.

Bundled Resources

Scripts:

```
transcribe_local_mlx.py
```
— Local MLX transcription (macOS ARM64, PEP 723 deps)
```
overlap_merge_transcribe.py
```
— Chunked transcription with overlap merge (remote API fallback)

References:

```
local_mlx_guide.md
```
— Performance benchmarks, max_tokens truncation, model compatibility
```
overlap_merge_strategy.md
```
— Why naive chunking fails, fuzzy merge algorithm