Trending-skills parlor-on-device-ai
On-device, real-time multimodal AI voice and vision assistant powered by Gemma 4 E2B and Kokoro TTS, running entirely locally via FastAPI WebSocket server.
git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/parlor-on-device-ai" ~/.claude/skills/aradotso-trending-skills-parlor-on-device-ai && rm -rf "$T"
skills/parlor-on-device-ai/SKILL.mdParlor On-Device AI
Skill by ara.so — Daily 2026 Skills collection.
Parlor is a real-time, on-device multimodal AI assistant. It combines Gemma 4 E2B (via LiteRT-LM) for speech and vision understanding with Kokoro TTS for voice output. Everything runs locally — no API keys, no cloud calls, no cost per request.
Architecture
Browser (mic + camera) │ │ WebSocket (audio PCM + JPEG frames) ▼ FastAPI server ├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision └── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back │ │ WebSocket (streamed audio chunks) ▼ Browser (playback + transcript)
Key features:
- Silero VAD in browser — hands-free, no push-to-talk
- Barge-in — interrupt AI mid-sentence by speaking
- Sentence-level TTS streaming — audio starts before full response is ready
- Platform-aware TTS — MLX backend on Apple Silicon, ONNX on Linux
Requirements
- Python 3.12+
- macOS with Apple Silicon or Linux with a supported GPU
- ~3 GB free RAM
package manageruv
Installation
git clone https://github.com/fikrikarim/parlor.git cd parlor # Install uv if needed curl -LsSf https://astral.sh/uv/install.sh | sh cd src uv sync uv run server.py
Open http://localhost:8000, grant camera and microphone permissions, and start talking.
Models download automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).
Configuration
Set environment variables before running:
# Use a pre-downloaded model instead of auto-downloading export MODEL_PATH=/path/to/gemma-4-E2B-it.litertlm # Change server port (default: 8000) export PORT=9000 uv run server.py
| Variable | Default | Description |
|---|---|---|
| auto-download from HuggingFace | Path to local model file |
| | Server port |
Project Structure
src/ ├── server.py # FastAPI WebSocket server + Gemma 4 inference ├── tts.py # Platform-aware TTS (MLX on Mac, ONNX on Linux) ├── index.html # Frontend UI (VAD, camera, audio playback) ├── pyproject.toml # Dependencies └── benchmarks/ ├── bench.py # End-to-end WebSocket benchmark └── benchmark_tts.py # TTS backend comparison
Key Components
server.py — FastAPI WebSocket Server
The server handles two WebSocket connections: one for receiving audio/video from the browser, one for streaming audio back.
# Simplified pattern from server.py from fastapi import FastAPI, WebSocket import asyncio app = FastAPI() @app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() async for data in websocket.iter_bytes(): # data contains PCM audio + optional JPEG frame response_text = await run_gemma_inference(data) audio_chunks = await run_tts(response_text) for chunk in audio_chunks: await websocket.send_bytes(chunk)
tts.py — Platform-Aware TTS
Kokoro TTS selects backend based on platform:
# tts.py uses platform detection import platform def get_tts_backend(): if platform.system() == "Darwin": # Apple Silicon: use MLX backend for GPU acceleration from kokoro_mlx import KokoroMLX return KokoroMLX() else: # Linux: use ONNX backend from kokoro import KokoroPipeline return KokoroPipeline(lang_code='a') tts = get_tts_backend() # Sentence-level streaming — yields audio as each sentence is ready async def synthesize_streaming(text: str): for sentence in split_sentences(text): audio = tts.synthesize(sentence) yield audio
Gemma 4 E2B Inference via LiteRT-LM
# LiteRT-LM inference pattern from litert_lm import LiteRTLM import os model_path = os.environ.get("MODEL_PATH", None) # Auto-downloads if MODEL_PATH not set model = LiteRTLM.from_pretrained( "google/gemma-4-E2B-it", local_path=model_path ) async def run_gemma_inference(audio_pcm: bytes, image_jpeg: bytes = None): inputs = {"audio": audio_pcm} if image_jpeg: inputs["image"] = image_jpeg response = "" async for token in model.generate_stream(**inputs): response += token return response
Running Benchmarks
cd src # End-to-end WebSocket latency benchmark uv run benchmarks/bench.py # Compare TTS backends (MLX vs ONNX) uv run benchmarks/benchmark_tts.py
Performance Reference (Apple M3 Pro)
| Stage | Time |
|---|---|
| Speech + vision understanding | ~1.8–2.2s |
| Response generation (~25 tokens) | ~0.3s |
| Text-to-speech (1–3 sentences) | ~0.3–0.7s |
| Total end-to-end | ~2.5–3.0s |
Decode speed: ~83 tokens/sec on GPU.
Common Patterns
Extending the System Prompt
Modify the prompt in
server.py to change the AI's persona or task:
SYSTEM_PROMPT = """You are a helpful language tutor. Respond conversationally in 1-3 sentences. If the user makes a grammar mistake, gently correct them. You can see through the user's camera and discuss what you observe."""
Adding a New Language for TTS
Kokoro supports multiple language codes. Set
lang_code in tts.py:
# Language codes: 'a' = American English, 'b' = British English # 'e' = Spanish, 'f' = French, 'z' = Chinese, 'j' = Japanese pipeline = KokoroPipeline(lang_code='e') # Spanish
Customizing VAD Sensitivity (index.html)
The Silero VAD threshold can be tuned in the frontend:
// In index.html — lower positiveSpeechThreshold = more sensitive const vad = await MicVAD.new({ positiveSpeechThreshold: 0.6, // default ~0.8, lower = triggers more easily negativeSpeechThreshold: 0.35, // how quickly it stops detecting speech minSpeechFrames: 3, onSpeechStart: () => { /* UI feedback */ }, onSpeechEnd: (audio) => sendAudioToServer(audio), });
Sending Frames Programmatically (WebSocket Client Example)
import asyncio import websockets import json import base64 async def send_audio_frame(audio_pcm_bytes: bytes, jpeg_bytes: bytes = None): uri = "ws://localhost:8000/ws" async with websockets.connect(uri) as ws: payload = { "audio": base64.b64encode(audio_pcm_bytes).decode(), } if jpeg_bytes: payload["image"] = base64.b64encode(jpeg_bytes).decode() await ws.send(json.dumps(payload)) # Receive streamed audio response async for message in ws: audio_chunk = message # raw PCM bytes # play or save audio_chunk
Troubleshooting
Model download fails
# Pre-download manually via huggingface_hub uv run python -c " from huggingface_hub import hf_hub_download path = hf_hub_download('google/gemma-4-E2B-it', 'gemma-4-E2B-it.litertlm') print(path) " export MODEL_PATH=/path/shown/above uv run server.py
Microphone/camera not working in browser
- Must access via
(not IP address) — browsers block media APIs on non-localhost HTTPhttp://localhost - Check browser permissions: address bar → lock icon → reset permissions
TTS not loading on Linux
# Ensure ONNX runtime is installed uv add onnxruntime # Or for GPU: uv add onnxruntime-gpu
High latency or slow inference
- Verify GPU is being used: check for Metal (Mac) or CUDA (Linux) in startup logs
- Close other GPU-heavy applications
- On Linux, confirm CUDA drivers match installed
versiononnxruntime-gpu
Port already in use
export PORT=8080 uv run server.py # Or kill the existing process: lsof -ti:8000 | xargs kill
uv sync
fails — Python version mismatch
uv sync# Parlor requires Python 3.12+ python3 --version # Install 3.12 via pyenv or system package manager, then: uv python pin 3.12 uv sync
Dependencies (pyproject.toml)
Key packages installed by
uv sync:
— Google AI Edge inference runtime for Gemmalitert-lm
+fastapi
— async web/WebSocket serveruvicorn
— Kokoro TTS ONNX backendkokoro
— Kokoro TTS MLX backend (Mac only)kokoro-mlx
— voice activity detection (browser-side via CDN)silero-vad
— model auto-downloadhuggingface-hub