Skills voice-stt-tts

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/aksenkin/voice-stt-tts" ~/.claude/skills/openclaw-skills-voice-stt-tts && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/aksenkin/voice-stt-tts" ~/.openclaw/skills/openclaw-skills-voice-stt-tts && rm -rf "$T"
manifest: skills/aksenkin/voice-stt-tts/SKILL.md
source content

Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

  • STT (Speech-to-Text) — transcribe voice messages via faster-whisper
  • TTS (Text-to-Speech) — voice replies via Edge TTS
  • 🎯 Result: voice → text → reply with voice

Installation

1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv:

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

What gets installed:

  • faster-whisper
    — Python library for transcription
  • Dependencies:
    ctranslate2
    ,
    onnxruntime
    ,
    huggingface-hub
    ,
    av
    ,
    numpy
    , and others.
  • Size: ~250 MB

Transcription Script

Path and content

File:

~/.openclaw/workspace/voice-messages/transcribe.py

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

What the script does:

  1. Accepts audio file path (
    --audio
    )
  2. Loads Whisper model (
    --model
    ):
    small
    by default
  3. Sets language (
    --lang
    ):
    en
    for English
  4. Transcribes with VAD filter (Voice Activity Detection)
  5. Outputs clean text to stdout

Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

OpenClaw Configuration

1. Configure STT (
tools.media.audio
)

Add to

~/.openclaw/openclaw.json
:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameters:

ParameterValueDescription
enabled
true
Enable audio transcription
maxBytes
20971520
Max file size (20 MB)
type
"cli"
Model type: CLI command
command
Python pathPath to python in venv
args
argument arrayArguments for script
{{MediaPath}}
placeholderReplaced with audio file path
timeoutSeconds
120
Transcription timeout (2 minutes)

2. Configure TTS (
messages.tts
)

Add to

~/.openclaw/openclaw.json
:

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameters:

ParameterValueDescription
auto
"inbound"
Key mode! — reply with voice only on incoming voice messages
provider
"edge"
TTS provider (free, no API key)
voice
"en-US-JennyNeural"
Voice (see available below)
lang
"en-US"
Locale (en-US for US english)

3. Full configuration example

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

Apply Changes

Restart Gateway

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

Testing

Test STT (transcription)

Action: Send a voice message to your Telegram bot

Expected result:

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

Example response:

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply

Expected result:

  • Voice file arrives in Telegram
  • Voice note (round bubble)

Expected behavior:

  • Incoming voice → bot replies with voice
  • Text messages → bot replies with text (this is normal!)

Available Edge TTS Voices

Female voices

VoiceIDUsage example
Jenny
en-US-JennyNeural
← current
Ana
en-US-AnaNeural
Softer

Male voices

VoiceIDUsage example
Dmitry
en-US-RogerNeural
More bass

How to change voice:

cat ~/.openclaw/openclaw.json | \
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

Additional Edge TTS Parameters

Adjusting speed, pitch, volume

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

Troubleshooting

Problem: Voice not transcribed

Logs show:

[ERROR] Transcription failed

Possible causes:

  1. File too large — > 20 MB

    # Solution: Increase maxBytes in config
    maxBytes: 52428800  # 50 MB
    
  2. Timeout — transcription took > 2 minutes

    # Solution: Increase timeoutSeconds
    timeoutSeconds: 180  # 3 minutes
    
  3. Model not downloaded — first run

    # Solution: Wait while it downloads (1-2 minutes)
    # Models are cached in ~/.cache/huggingface/
    

Problem: No voice reply

Possible causes:

  1. Reply too short (< 10 characters)

    • TTS skips very short replies
    • Solution: this is expected behavior
  2. auto: "inbound" but text message

    • TTS in
      inbound
      mode replies with voice only on voice messages
    • Text messages get text replies — this is correct!
  3. Edge TTS unavailable

    # Check
    curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
    # If error — temporarily unavailable
    

Performance

Transcription time (Raspberry Pi 4/ARM)

Whisper ModelEst. timeQuality
tiny
~5-10 secLow
base
~10-20 secMedium
small
~20-40 secHigh ← current
medium
~40-80 secVery high
large
~80-160 secMaximum

Recommendation: For Raspberry Pi use

small
or
base
.
medium
/
large
will be very slow.

Where Whisper models are stored

~/.cache/huggingface/

Models download automatically on first run.

Done! 🎉

After completing these steps:

  1. ✅ faster-whisper installed in venv
  2. transcribe.py
    script created
  3. ✅ OpenClaw configured (STT + TTS)
  4. ✅ Gateway restarted
  5. ✅ Voice messages working

Now your Telegram bot:

  • 🎙️ Accepts voice → transcribes via faster-whisper
  • 🎤 Replies with voice → generates via Edge TTS
  • 💬 Accepts text → replies with text (as usual)

Useful links:


Created: 2026-03-01 for OpenClaw 2026.2.26