Awesome-omni-skill stt-tts-service
Lightweight local speech-to-text and text-to-speech service for OpenClaw
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/stt-tts-service" ~/.claude/skills/diegosouzapw-awesome-omni-skill-stt-tts-service && rm -rf "$T"
manifest:
skills/development/stt-tts-service/SKILL.mdsource content
STT-TTS Service
A lightweight, local speech-to-text (STT) and text-to-speech (TTS) service that runs on any device connected to your OpenClaw server. Perfect for voice-enabled workflows and flexible resource allocation.
Features
- Speech-to-Text: Transcribe audio using faster-whisper (4x faster than OpenAI Whisper)
- Text-to-Speech: Generate natural speech using piper-tts or pyttsx3 fallback
- 100% Local: No cloud APIs, works offline after initial model download
- Flexible Deployment: Run on any device - Raspberry Pi, laptop, or GPU server
- HTTP API: Simple REST endpoints for easy integration
Quick Start
Installation
# Clone or download this skill cd stt-tts-service # Install dependencies pip install -r requirements.txt # Start the service python main.py
Docker Deployment
docker build -t stt-tts-service . docker run -p 8765:8765 stt-tts-service
API Endpoints
POST /stt - Speech to Text
Transcribe audio files to text.
curl -X POST http://localhost:8765/stt \ -F "audio=@recording.wav"
Response:
{ "text": "Hello, this is the transcribed text.", "language": "en", "duration": 3.5 }
POST /tts - Text to Speech
Convert text to audio.
curl -X POST http://localhost:8765/tts \ -H "Content-Type: application/json" \ -d '{"text": "Hello world", "voice": "default"}' \ --output speech.wav
Parameters:
(required): Text to synthesizetext
(optional): Voice ID to usevoice
(optional): Speech rate multiplier (0.5-2.0)speed
GET /health
Health check endpoint.
curl http://localhost:8765/health
GET /models
List available models and voices.
curl http://localhost:8765/models
WebSocket Streaming (Real-time Voice)
For real-time voice conversations, use WebSocket endpoints:
WS /ws/stt - Streaming Speech-to-Text
Stream audio and receive transcriptions in real-time.
const ws = new WebSocket('ws://localhost:8765/ws/stt'); // Send audio chunks (16kHz, 16-bit, mono PCM) ws.send(audioBuffer); // Receive transcriptions ws.onmessage = (event) => { const data = JSON.parse(event.data); console.log(data.text); // Transcribed text }; // Flush remaining audio ws.send(JSON.stringify({action: "flush"}));
WS /ws/tts - Streaming Text-to-Speech
Send text and receive audio chunks in real-time.
const ws = new WebSocket('ws://localhost:8765/ws/tts'); // Send text to synthesize ws.send(JSON.stringify({text: "Hello world"})); // Receive audio chunks ws.onmessage = (event) => { if (event.data instanceof Blob) { // Audio chunk - play it playAudio(event.data); } };
WS /ws/voice - Full Duplex Voice Conversation
Stream audio input and receive audio output for real-time voice-to-voice.
const ws = new WebSocket('ws://localhost:8765/ws/voice'); // Stream microphone audio navigator.mediaDevices.getUserMedia({audio: true}) .then(stream => { // Send audio chunks to WebSocket }); // Handle responses ws.onmessage = (event) => { const data = JSON.parse(event.data); if (data.type === "transcript") { // User's speech transcribed - send to your AI sendToAI(data.text); } }; // Send AI response to be spoken ws.send(JSON.stringify({action: "speak", text: aiResponse}));
Configuration
Set environment variables or edit
config.py:
| Variable | Default | Description |
|---|---|---|
| | Whisper model: tiny, base, small, medium |
| | TTS engine: piper, pyttsx3, auto |
| | Compute device: cpu, cuda, auto |
| | Server bind address |
| | Server port |
Model Sizes
| STT Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | ~75MB | Fastest | Basic |
| base | ~150MB | Fast | Good |
| small | ~500MB | Medium | Better |
| medium | ~1.5GB | Slower | Best |
OpenClaw Integration
Register this service with your OpenClaw server:
openclaw service register http://device-ip:8765
Then use in your workflows:
- action: stt input: ${audio_file} output: transcription - action: tts input: "Hello, ${user_name}!" output: greeting_audio
Requirements
- Python 3.9+
- 2GB RAM minimum (4GB recommended for medium model)
- ~500MB disk space (plus model storage)