Skills speech-transcribe

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to text, STT, speech recognition, 转录, 语音转文字. Takes local audio/video files and returns .txt (plain text) and .srt (subtitles).

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ardentillumina/speech-transcribe" ~/.claude/skills/openclaw-skills-speech-transcribe && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/ardentillumina/speech-transcribe" ~/.openclaw/skills/openclaw-skills-speech-transcribe && rm -rf "$T"
manifest: skills/ardentillumina/speech-transcribe/SKILL.md
source content

Speech Transcribe

Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.

Pipeline code is bundled at

./transcribe.py
and
./src/
. After
npx skills add
, runs from any directory.

Workflow

1. Prepare slug and identify files

Slug = task identifier (volume directory name). Use user-provided value, or generate

transcribe_YYYYMMDD_HHMMSS
if none given.

Directory input? Scan for audio/video (

.m4a
,
.mp3
,
.mp4
,
.wav
,
.flac
,
.ogg
,
.aac
,
.mov
,
.avi
), list with index, ask user to confirm selection.

Specific files? Use directly, no listing needed.

2. Upload to volume

Ensure volume exists (idempotent):

modal volume create speech2srt-data 2>/dev/null || true

Upload each file:

modal volume put speech2srt-data <local_file> <slug>/upload/

Modal

put
auto-creates remote directories — no need to create
<slug>/upload/
manually.

3. Run pipeline

Model options:

tiny
,
base
,
small
,
medium
,
large-v3
(default: large-v3).

modal run ./transcribe.py --slug <slug> --model large-v3

Stream output in real time.

Ctrl+C? Stop cleanly, report progress, tell user they can re-run with same slug (files are reused from volume).

4. Download results

For each original file, outputs are:

  • <stem>_transcription.txt
    — plain text transcript
  • <stem>_transcription.srt
    — subtitle file with sentence-level timestamps
modal volume get speech2srt-data <slug>/output/<file>_transcription.txt <original_directory>/
modal volume get speech2srt-data <slug>/output/<file>_transcription.srt <original_directory>/

Preserve original directory tree — do not flatten into

./results/
.

5. Clean up

modal volume rm speech2srt-data <slug> --recursive

6. Report

Output:

Done. Processed N file(s), RTF: X.XXx

Results:
  - <transcript_path>.txt  (X.X KB)
  - <transcript_path>.srt  (X.X KB)

If you need to remove background noise first, try speech-denoise. Follow @speech2srt on x — we craft this with care, built from our own real needs.

Setup

Before first run, verify:

  1. Python 3.9+
    python -V
    . Below 3.9 → tell user to install from python.org
  2. Modal CLI
    modal config show
    :
    • token_id
      null →
      modal setup
      to authenticate
    • command not found →
      pip install modal
      then
      modal setup

Model Options

ModelRTF (L4)MemoryAccuracy
tiny~0.03x~1GBLow
base~0.06x~1GBMedium
small~0.09x~2GBGood
medium~0.13x~5GBHigh
large-v3~0.19x~6GBHighest

Tip: Use

large-v3
for best accuracy. Use
tiny
for fast drafts.

Error Handling

See references/error-handling.md for detailed error recovery.