Skills voice-stt-tts
Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/aksenkin/voice-stt-tts" ~/.claude/skills/openclaw-skills-voice-stt-tts && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/aksenkin/voice-stt-tts" ~/.openclaw/skills/openclaw-skills-voice-stt-tts && rm -rf "$T"
skills/aksenkin/voice-stt-tts/SKILL.mdVoice Messages (STT + TTS) for OpenClaw 🎙️
Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.
What we configure
- ✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper
- ✅ TTS (Text-to-Speech) — voice replies via Edge TTS
- 🎯 Result: voice → text → reply with voice
Installation
1. Create virtual environment (venv)
For Ubuntu create an isolated venv:
python3 -m venv ~/.openclaw/workspace/voice-messages
2. Install faster-whisper
Install packages in venv:
~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper
What gets installed:
— Python library for transcriptionfaster-whisper- Dependencies:
,ctranslate2
,onnxruntime
,huggingface-hub
,av
, and others.numpy - Size: ~250 MB
Transcription Script
Path and content
File:
~/.openclaw/workspace/voice-messages/transcribe.py
#!/usr/bin/env python3 import argparse from faster_whisper import WhisperModel def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str: model = WhisperModel( model_name, device=device, compute_type="int8" if device == "cpu" else "float16", ) segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True) text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip() return text def main(): p = argparse.ArgumentParser() p.add_argument("--audio", required=True) p.add_argument("--model", default="small") p.add_argument("--lang", default="en") p.add_argument("--device", default="cpu", choices=["cpu", "cuda"]) args = p.parse_args() text = transcribe(args.audio, args.model, args.lang, args.device) print(text if text else "") if __name__ == "__main__": main()
What the script does:
- Accepts audio file path (
)--audio - Loads Whisper model (
):--model
by defaultsmall - Sets language (
):--lang
for Englishen - Transcribes with VAD filter (Voice Activity Detection)
- Outputs clean text to stdout
Make file executable:
chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
OpenClaw Configuration
1. Configure STT (tools.media.audio
)
tools.media.audioAdd to
~/.openclaw/openclaw.json:
{ "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } } } }
Parameters:
| Parameter | Value | Description |
|---|---|---|
| | Enable audio transcription |
| | Max file size (20 MB) |
| | Model type: CLI command |
| Python path | Path to python in venv |
| argument array | Arguments for script |
| placeholder | Replaced with audio file path |
| | Transcription timeout (2 minutes) |
2. Configure TTS (messages.tts
)
messages.ttsAdd to
~/.openclaw/openclaw.json:
{ "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } } } }
Parameters:
| Parameter | Value | Description |
|---|---|---|
| | Key mode! — reply with voice only on incoming voice messages |
| | TTS provider (free, no API key) |
| | Voice (see available below) |
| | Locale (en-US for US english) |
3. Full configuration example
{ "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } }, }, "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } }, "ackReactionScope": "group-mentions" } }
Apply Changes
Restart Gateway
# Method 1: via openclaw CLI openclaw gateway restart # Method 2: via systemd systemctl --user restart openclaw-gateway # Check status systemctl --user status openclaw-gateway # Should show: active (running)
Testing
Test STT (transcription)
Action: Send a voice message to your Telegram bot
Expected result:
[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>
Example response:
[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
Test TTS (voice replies)
Action: After successful transcription, bot should send a voice reply
Expected result:
- Voice file arrives in Telegram
- Voice note (round bubble)
Expected behavior:
- Incoming voice → bot replies with voice
- Text messages → bot replies with text (this is normal!)
Available Edge TTS Voices
Female voices
| Voice | ID | Usage example |
|---|---|---|
| Jenny | | ← current |
| Ana | | Softer |
Male voices
| Voice | ID | Usage example |
|---|---|---|
| Dmitry | | More bass |
How to change voice:
cat ~/.openclaw/openclaw.json | \ jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json systemctl --user restart openclaw-gateway
Additional Edge TTS Parameters
Adjusting speed, pitch, volume
{ "messages": { "tts": { "edge": { "voice": "en-US-JennyNeural", "lang": "en-US", "rate": "+10%", // Speed: -50% to +100% "pitch": "-5%", // Pitch: -50% to +50% "volume": "+5%" // Volume: -100% to +100% } } } }
Troubleshooting
Problem: Voice not transcribed
Logs show:
[ERROR] Transcription failed
Possible causes:
-
File too large — > 20 MB
# Solution: Increase maxBytes in config maxBytes: 52428800 # 50 MB -
Timeout — transcription took > 2 minutes
# Solution: Increase timeoutSeconds timeoutSeconds: 180 # 3 minutes -
Model not downloaded — first run
# Solution: Wait while it downloads (1-2 minutes) # Models are cached in ~/.cache/huggingface/
Problem: No voice reply
Possible causes:
-
Reply too short (< 10 characters)
- TTS skips very short replies
- Solution: this is expected behavior
-
auto: "inbound" but text message
- TTS in
mode replies with voice only on voice messagesinbound - Text messages get text replies — this is correct!
- TTS in
-
Edge TTS unavailable
# Check curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100 # If error — temporarily unavailable
Performance
Transcription time (Raspberry Pi 4/ARM)
| Whisper Model | Est. time | Quality |
|---|---|---|
| ~5-10 sec | Low |
| ~10-20 sec | Medium |
| ~20-40 sec | High ← current |
| ~40-80 sec | Very high |
| ~80-160 sec | Maximum |
Recommendation: For Raspberry Pi use
small or base. medium/large will be very slow.
Where Whisper models are stored
~/.cache/huggingface/
Models download automatically on first run.
Done! 🎉
After completing these steps:
- ✅ faster-whisper installed in venv
- ✅
script createdtranscribe.py - ✅ OpenClaw configured (STT + TTS)
- ✅ Gateway restarted
- ✅ Voice messages working
Now your Telegram bot:
- 🎙️ Accepts voice → transcribes via faster-whisper
- 🎤 Replies with voice → generates via Edge TTS
- 💬 Accepts text → replies with text (as usual)
Useful links:
- OpenClaw docs: https://docs.openclaw.ai
- TTS docs: https://docs.openclaw.ai/tts
- Audio docs: https://docs.openclaw.ai/nodes/audio
- Install skills:
npx clawhub search voice
Created: 2026-03-01 for OpenClaw 2026.2.26