Skills audio-to-text-and-video-to-text
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ahqazi-dev/audio-to-text-and-video-to-text" ~/.claude/skills/openclaw-skills-audio-to-text-and-video-to-text-581d46 && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/ahqazi-dev/audio-to-text-and-video-to-text" ~/.openclaw/skills/openclaw-skills-audio-to-text-and-video-to-text-581d46 && rm -rf "$T"
skills/ahqazi-dev/audio-to-text-and-video-to-text/SKILL.mdTranscription Skill
Converts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.
Overview
This skill handles the full pipeline:
- Media extraction — use ffmpeg to strip audio from video files and convert to a Whisper-compatible format
- Chunking — split large files (>25 MB) into overlapping segments to stay within API limits
- Transcription — send each chunk to OpenAI's Whisper API
- Assembly — merge chunk transcripts, adjusting timestamps, into a single clean output
- Post-processing — optionally clean up with Claude (punctuation, speaker labels, summaries)
Requirements
- ffmpeg must be installed (
to verify — it's usually pre-installed in claude.ai's environment)which ffmpeg - OpenAI API key stored in the environment as
— the user must provide thisOPENAI_API_KEY - Python packages:
,openai
(install via pip if needed)pydub
Quick Start
When a user provides a media file, run the transcription script:
# Install dependencies if missing pip install openai pydub --break-system-packages -q # Run transcription python /home/claude/transcription/scripts/transcribe.py \ --input "/path/to/media/file" \ --output "/mnt/user-data/outputs/transcript.txt" \ --api-key "$OPENAI_API_KEY"
See
scripts/transcribe.py for the full implementation.
Supported Formats
| Category | Formats |
|---|---|
| Audio | mp3, wav, m4a, ogg, flac, aac, opus, wma |
| Video | mp4, mov, avi, mkv, webm, wmv, m4v |
ffmpeg handles extraction from any of these.
Options & Flags
| Flag | Default | Description |
|---|---|---|
| | Whisper model to use (, ) |
| auto-detect | ISO 639-1 language code (e.g. , , ) |
| | Output format: , , , |
| off | Include timestamps in output |
| | Max chunk size in MB (must be ≤ 25) |
| none | Context hint to improve accuracy (e.g. domain vocab) |
Output Formats
- txt — plain text, ideal for most uses
- srt — SubRip subtitle format (for video players)
- vtt — WebVTT format (for web video)
- json — full Whisper JSON with segments and timestamps
Step-by-Step Workflow
1. Check for the file
Ask the user to upload the file or provide a local path. Check:
ls /mnt/user-data/uploads/
2. Check ffmpeg and install deps
which ffmpeg && ffmpeg -version 2>&1 | head -1 pip install openai pydub --break-system-packages -q 2>&1 | tail -3
3. Get the API key
If
OPENAI_API_KEY is not set in the environment, ask the user:
"Please provide your OpenAI API key — it starts with
. You can get one at https://platform.openai.com/api-keys"sk-
4. Run the script
python /home/claude/transcription/scripts/transcribe.py \ --input "<file_path>" \ --output "/mnt/user-data/outputs/transcript.txt"
5. Post-process (optional but recommended)
After transcription, offer to:
- Clean up punctuation/formatting with Claude
- Summarize the content
- Extract action items, speakers, or key topics
- Translate to another language
Use the transcript text directly in the conversation for these steps.
Handling Large Files
The script automatically splits files > 20 MB into overlapping chunks (with 1-second overlap for continuity). Each chunk is transcribed separately and the results are merged.
For very long recordings (> 1 hour), warn the user it may take a few minutes and show progress.
Error Handling
| Error | Fix |
|---|---|
| Invalid API key — ask user to verify |
| Wait 60s and retry, or use |
| Reduce below 25 |
| or |
| File may be corrupt or wrong format |
Example Interaction
User: "Can you transcribe this meeting recording?" [uploads meeting.mp4] → Check file exists in /mnt/user-data/uploads/ → Run transcribe.py on it → Save transcript to /mnt/user-data/outputs/ → present_files() to the user → Offer to summarize or extract action items
Notes for openclaw.ai
- Always save output to
so users can download it/mnt/user-data/outputs/ - Use
to share the transcript file with the user after savingpresent_files() - For business users, suggest the
orsrt
format if they're adding captions to videovtt - The
flag is useful for technical/domain-specific content: pass a few domain keywords to improve accuracy--prompt