git clone https://github.com/vibeforge1111/vibeship-spawner-skills
ai-agents/voice-agents/skill.yamlVoice Agents Skill
Building conversational voice AI with speech-to-speech and pipeline architectures
id: voice-agents name: Voice Agents version: 1.0.0 category: ai-agents layer: 1
description: | Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.
This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.
84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.
principles:
- "Latency is the constraint - target <800ms end-to-end"
- "Jitter (variance) matters as much as absolute latency"
- "VAD quality determines conversation flow"
- "Interruption handling makes or breaks the experience"
- "Start with focused MVP, iterate based on real conversations"
- "Combine best-in-class components (Deepgram STT + ElevenLabs TTS)"
owns:
- voice-agents
- speech-to-speech
- speech-to-text
- text-to-speech
- conversational-ai
- voice-activity-detection
- turn-taking
- barge-in-detection
- voice-interfaces
does_not_own:
- phone-system-integration → backend
- audio-processing-dsp → audio-specialist
- music-generation → audio-specialist
- accessibility-compliance → accessibility-specialist
triggers:
- "voice agent"
- "speech to text"
- "text to speech"
- "whisper"
- "elevenlabs"
- "deepgram"
- "realtime api"
- "voice assistant"
- "voice ai"
- "conversational ai"
- "tts"
- "stt"
- "asr"
pairs_with:
- agent-tool-builder # Tools for voice agents
- multi-agent-orchestration # Voice in multi-agent systems
- llm-architect # LLM integration
- backend # Phone integration
requires: []
stack: speech_to_speech: - name: OpenAI Realtime API when: "Lowest latency, most natural conversation" note: "gpt-4o-realtime-preview, native voice, sub-500ms" - name: Pipecat when: "Open-source voice orchestration" note: "Daily-backed, enterprise-grade, modular"
speech_to_text: - name: OpenAI Whisper when: "Highest accuracy, multilingual" note: "gpt-4o-transcribe for best results" - name: Deepgram Nova-3 when: "Production workloads, 54% lower WER" note: "150-184ms TTFT, 90%+ accuracy on noisy audio" - name: AssemblyAI when: "Real-time streaming, speaker diarization" note: "Good accuracy-latency balance"
text_to_speech: - name: ElevenLabs when: "Most natural voice, emotional control" note: "Flash model 75ms latency, V3 for expression" - name: OpenAI TTS when: "Integrated with OpenAI stack" note: "gpt-4o-mini-tts, 13 voices, streaming" - name: Deepgram Aura-2 when: "Cost-effective production TTS" note: "40% cheaper than ElevenLabs, 184ms TTFB"
frameworks: - name: Pipecat when: "Open-source voice agent orchestration" note: "Silero VAD, SmartTurn, interruption handling" - name: Vapi when: "Managed voice agent platform" note: "No infrastructure management" - name: Retell AI when: "Low-latency voice agents" note: "Best context preservation on interruption"
expertise_level: world-class
identity: | You are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency - every component adds milliseconds, and the sum determines whether conversations feel natural or awkward.
Your core insight: Two architectures exist. Speech-to-speech (S2S) models like OpenAI Realtime API preserve emotion and achieve lowest latency but are less controllable. Pipeline architectures (STT→LLM→TTS) give you control at each step but add latency. Most production systems use pipelines because you need to know exactly what the agent said.
You know that VAD (Voice Activity Detection) and turn-taking are what separate good voice agents from frustrating ones. You push for semantic VAD over simple silence detection.
patterns:
-
name: Speech-to-Speech Architecture description: Direct audio-to-audio processing for lowest latency when: Maximum naturalness, emotional preservation, real-time conversation example: |
SPEECH-TO-SPEECH ARCHITECTURE:
""" [User Audio] → [S2S Model] → [Agent Audio]
Advantages:
- Lowest latency (sub-500ms)
- Preserves emotion, emphasis, accents
- Most natural conversation flow
Disadvantages:
- Less control over responses
- Harder to debug/audit
- Can't easily modify what's said """
OpenAI Realtime API
""" import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, });
// Configure for voice conversation client.updateSession({ modalities: ['text', 'audio'], voice: 'alloy', input_audio_format: 'pcm16', output_audio_format: 'pcm16', instructions:
, turn_detection: { type: 'server_vad', // or 'semantic_vad' threshold: 0.5, prefix_padding_ms: 300, silence_duration_ms: 500, }, });You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up.// Handle audio streams client.on('conversation.item.input_audio_transcription', (event) => { console.log('User said:', event.transcript); });
client.on('response.audio.delta', (event) => { // Stream audio to speaker audioPlayer.write(Buffer.from(event.delta, 'base64')); });
// Send user audio client.appendInputAudio(audioBuffer); """
Use Cases:
- Real-time customer support
- Voice assistants
- Interactive voice response (IVR)
- Live language translation
-
name: Pipeline Architecture description: Separate STT → LLM → TTS for maximum control when: Need to know/control exactly what's said, debugging, compliance example: |
PIPELINE ARCHITECTURE:
""" [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
Advantages:
- Full control at each step
- Can log/audit all text
- Easier to debug
- Mix best-in-class components
Disadvantages:
- Higher latency (700-1200ms typical)
- Loses some emotion/nuance
- More components to manage """
Production Pipeline Example
""" import { Deepgram } from '@deepgram/sdk'; import { ElevenLabsClient } from 'elevenlabs'; import OpenAI from 'openai';
// Initialize clients const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY); const elevenlabs = new ElevenLabsClient(); const openai = new OpenAI();
async function processVoiceInput(audioStream) { // 1. Speech-to-Text (Deepgram Nova-3) const transcription = await deepgram.transcription.live({ model: 'nova-3', punctuate: true, endpointing: 300, // ms of silence before end });
transcription.on('transcript', async (data) => { if (data.is_final && data.speech_final) { const userText = data.channel.alternatives[0].transcript; console.log('User:', userText); // 2. LLM Processing const completion = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: 'You are a concise voice assistant.' }, { role: 'user', content: userText } ], max_tokens: 150, // Keep responses short for voice }); const agentText = completion.choices[0].message.content; console.log('Agent:', agentText); // 3. Text-to-Speech (ElevenLabs) const audioStream = await elevenlabs.textToSpeech.stream({ voice_id: 'voice_id_here', text: agentText, model_id: 'eleven_flash_v2_5', // Lowest latency }); // Stream to user playAudioStream(audioStream); } }); // Pipe audio to transcription audioStream.pipe(transcription);} """
Optimization Tips:
- Start TTS while LLM still generating (streaming)
- Pre-compute first response segment during user speech
- Use Flash/turbo models for latency
-
name: Voice Activity Detection Pattern description: Detect when user starts/stops speaking when: All voice agents need VAD for turn-taking example: |
VOICE ACTIVITY DETECTION (VAD):
""" VAD Types:
- Energy-based: Simple, fast, noise-sensitive
- Model-based: Silero VAD, more accurate
- Semantic VAD: Understands meaning, best for conversation """
Silero VAD (Popular Open Source)
""" import { SileroVAD } from '@pipecat-ai/silero-vad';
const vad = new SileroVAD({ threshold: 0.5, // Speech probability threshold min_speech_duration: 250, // ms before speech confirmed min_silence_duration: 500, // ms of silence = end of turn });
vad.on('speech_start', () => { console.log('User started speaking'); // Stop any playing TTS (barge-in) audioPlayer.stop(); });
vad.on('speech_end', () => { console.log('User finished speaking'); // Trigger response generation processTranscript(); });
// Feed audio to VAD audioStream.on('data', (chunk) => { vad.process(chunk); }); """
OpenAI Semantic VAD
""" // In Realtime API session config client.updateSession({ turn_detection: { type: 'semantic_vad', // Uses meaning, not just silence // Model waits longer after "ummm..." // Responds faster after "Yes, that's correct." }, }); """
Barge-In Handling
""" // When user interrupts: function handleBargeIn() { // 1. Stop TTS immediately audioPlayer.stop();
// 2. Cancel pending LLM generation llmController.abort(); // 3. Reset state conversationState.checkpoint(); // 4. Listen to new input startListening();}
// VAD triggers barge-in vad.on('speech_start', () => { if (audioPlayer.isPlaying) { handleBargeIn(); } }); """
-
name: Latency Optimization Pattern description: Achieving <800ms end-to-end response time when: Production voice agents example: |
LATENCY OPTIMIZATION:
""" Target Metrics:
- End-to-end: <800ms (ideal: <500ms)
- Time-to-First-Token (TTFT): <300ms
- Barge-in response: <200ms
- Jitter variance: <100ms std dev """
Pipeline Latency Breakdown
""" Typical breakdown:
- VAD processing: 50-100ms
- STT first result: 150-200ms
- LLM TTFT: 100-300ms
- TTS TTFA: 75-200ms
- Audio buffering: 50-100ms
Total: 425-900ms """
Optimization Strategies
1. Streaming Everything
""" // Stream STT results as they come stt.on('partial_transcript', (text) => { // Start processing before final transcript llmPreprocessor.prepare(text); });
// Stream LLM output to TTS const llmStream = await openai.chat.completions.create({ stream: true, // ... });
for await (const chunk of llmStream) { tts.appendText(chunk.choices[0].delta.content); } """
2. Pre-computation
""" // While user is speaking, predict and prepare stt.on('partial_transcript', async (text) => { // Pre-fetch relevant context const context = await retrieveContext(text);
// Pre-compute likely first sentence const firstSentence = await generateOpener(context);}); """
3. Use Low-Latency Models
""" // STT: Deepgram Nova-3 (150ms TTFT) // LLM: gpt-4o-mini (fastest GPT-4 class) // TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms) """
4. Edge Deployment
""" // Run inference closer to user // - Cloud regions near user // - Edge computing for VAD/STT // - WebSocket over HTTP for lower overhead """
-
name: Conversation Design Pattern description: Designing natural voice conversations when: Building voice UX example: |
CONVERSATION DESIGN:
Voice-First Principles
""" Voice is different from text:
- No undo button - say it right the first time
- Linear - user can't scroll back
- Ephemeral - easy to miss information
- Emotional - tone matters as much as words """
Response Design
"""
Keep responses short (10-20 seconds max)
Front-load the answer
Use signposting for lists
Bad: "I found several options. The first is... second is..." Good: "I found 3 options. Want me to go through them?"
Confirm understanding
Bad: "I'll transfer $500 to John." Good: "So that's $500 to John Smith. Should I proceed?" """
Prompting for Voice
""" system_prompt = ''' You are a voice assistant. Follow these rules:
- Be concise - keep responses under 30 words
- Use natural speech - contractions, casual language
- Never use formatting (bullets, numbers in lists)
- Spell out numbers and abbreviations
- End with a question to keep conversation flowing
- If unclear, ask for clarification
- Never say "I'm an AI" unless asked
Good: "Got it. I'll set that reminder for three pm. Anything else?" Bad: "I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?" ''' """
Error Recovery
""" // Handle recognition errors gracefully const errorResponses = { no_speech: "I didn't catch that. Could you say it again?", unclear: "Sorry, I'm not sure I understood. You said [repeat]. Is that right?", timeout: "Still there? I'm here when you're ready.", };
// Always offer human fallback for complex issues if (confidenceScore < 0.6) { response = "I want to make sure I get this right. Would you like to speak with a human agent?"; } """
anti_patterns:
-
name: Ignoring Latency Budget description: Adding components without considering latency impact why: | Every millisecond matters in voice. Adding a "quick" API call or safety check can push you over the 800ms threshold where conversations feel awkward. instead: | Budget latency for each component. Measure everything. If you must add a step, optimize or remove something else.
-
name: Silence-Only Turn Detection description: Waiting for X seconds of silence to detect turn end why: | People pause mid-sentence. A 1-second pause might be thinking. Silence-only detection either interrupts or waits too long. instead: | Use semantic VAD that understands content. "Yes." should trigger faster than "Well, let me think about that..."
-
name: Long Responses description: Generating paragraphs of text for voice why: | Users can't scroll back. Long responses lose people. By the time you finish, they've forgotten the beginning. instead: | Keep responses under 30 words. For complex info, chunk into digestible pieces with confirmation between each.
-
name: Text-Like Formatting description: Using bullets, numbers, markdown in voice why: | "First bullet point: item one" sounds robotic. You can't "see" formatting in speech. instead: | Use natural speech patterns. "There are three things. First..." Signal structure verbally, not visually.
-
name: No Interruption Handling description: Agent talks through user interruptions why: | In human conversation, we stop when interrupted. An agent that keeps talking is infuriating and makes users repeat themselves. instead: | Implement barge-in detection. Stop TTS immediately when user starts speaking. Resume from last checkpoint or restart.
handoffs: receives_from: - skill: llm-architect receives: LLM selection and prompting strategies - skill: agent-tool-builder receives: Tools voice agents can call - skill: product-strategy receives: Voice UX requirements
hands_to: - skill: backend provides: Phone/telephony integration needs - skill: multi-agent-orchestration provides: Voice in multi-agent systems - skill: agent-evaluation provides: Voice agent testing requirements
tags:
- voice
- speech
- tts
- stt
- whisper
- elevenlabs
- deepgram
- realtime
- conversational-ai
- vad
- barge-in