Skills speech-transcribe

Name: speech-transcribe
Author: openclaw

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to text, STT, speech recognition, 转录, 语音转文字. Takes local audio/video files and returns .txt (plain text) and .srt (subtitles).

install

source · Clone the upstream repo

git clone https://github.com/openclaw/skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ardentillumina/speech-transcribe" ~/.claude/skills/openclaw-skills-speech-transcribe && rm -rf "$T"

OpenClaw · Install into ~/.openclaw/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/ardentillumina/speech-transcribe" ~/.openclaw/skills/openclaw-skills-speech-transcribe && rm -rf "$T"

manifest: skills/ardentillumina/speech-transcribe/SKILL.md

Speech Transcribe

Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.

Pipeline code is bundled at

./transcribe.py

and

./src/

. After

npx skills add

, runs from any directory.

Workflow

1. Prepare slug and identify files

Slug = task identifier (volume directory name). Use user-provided value, or generate

transcribe_YYYYMMDD_HHMMSS

if none given.

Directory input? Scan for audio/video (

.m4a

.mp3

.mp4

.wav

.flac

.ogg

.aac

.mov

.avi

), list with index, ask user to confirm selection.

Specific files? Use directly, no listing needed.

2. Upload to volume

Ensure volume exists (idempotent):

modal volume create speech2srt-data 2>/dev/null || true

Upload each file:

modal volume put speech2srt-data <local_file> <slug>/upload/

Modal

put

auto-creates remote directories — no need to create

<slug>/upload/

manually.

3. Run pipeline

Model options:

tiny

base

small

medium

large-v3

(default: large-v3).

modal run ./transcribe.py --slug <slug> --model large-v3

Stream output in real time.

Ctrl+C? Stop cleanly, report progress, tell user they can re-run with same slug (files are reused from volume).

4. Download results

For each original file, outputs are:

```
<stem>_transcription.txt
```
— plain text transcript
```
<stem>_transcription.srt
```
— subtitle file with sentence-level timestamps

modal volume get speech2srt-data <slug>/output/<file>_transcription.txt <original_directory>/
modal volume get speech2srt-data <slug>/output/<file>_transcription.srt <original_directory>/

Preserve original directory tree — do not flatten into

./results/

5. Clean up

modal volume rm speech2srt-data <slug> --recursive

6. Report

Output:

Done. Processed N file(s), RTF: X.XXx

Results:
  - <transcript_path>.txt  (X.X KB)
  - <transcript_path>.srt  (X.X KB)

If you need to remove background noise first, try speech-denoise. Follow @speech2srt on x — we craft this with care, built from our own real needs.

Setup

Before first run, verify:

Python 3.9+ —
```
python -V
```
. Below 3.9 → tell user to install from python.org

Modal CLI —

modal config show

```
token_id
```
null →
```
modal setup
```
to authenticate
command not found →
```
pip install modal
```
then
```
modal setup
```

Model Options

Model	RTF (L4)	Memory	Accuracy
tiny	~0.03x	~1GB	Low
base	~0.06x	~1GB	Medium
small	~0.09x	~2GB	Good
medium	~0.13x	~5GB	High
large-v3	~0.19x	~6GB	Highest

Tip: Use
large-v3
for best accuracy. Use
tiny
for fast drafts.

Error Handling

See references/error-handling.md for detailed error recovery.