Skills azure-ai-voicelive-py
Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.
install
source · Clone the upstream repo
git clone https://github.com/microsoft/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/microsoft/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.github/plugins/azure-sdk-python/skills/azure-ai-voicelive-py" ~/.claude/skills/microsoft-skills-azure-ai-voicelive-py && rm -rf "$T"
manifest:
.github/plugins/azure-sdk-python/skills/azure-ai-voicelive-py/SKILL.mdsource content
Azure AI Voice Live SDK
Build real-time voice AI applications with bidirectional WebSocket communication.
Installation
pip install azure-ai-voicelive aiohttp azure-identity
Environment Variables
AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com # For API key auth (not recommended for production) AZURE_COGNITIVE_SERVICES_KEY=<api-key>
Authentication
DefaultAzureCredential (preferred):
from azure.ai.voicelive.aio import connect from azure.identity.aio import DefaultAzureCredential async with connect( endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"], credential=DefaultAzureCredential(), model="gpt-4o-realtime-preview", credential_scopes=["https://cognitiveservices.azure.com/.default"] ) as conn: ...
API Key:
from azure.ai.voicelive.aio import connect from azure.core.credentials import AzureKeyCredential async with connect( endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"], credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]), model="gpt-4o-realtime-preview" ) as conn: ...
Quick Start
import asyncio import os from azure.ai.voicelive.aio import connect from azure.identity.aio import DefaultAzureCredential async def main(): async with connect( endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"], credential=DefaultAzureCredential(), model="gpt-4o-realtime-preview", credential_scopes=["https://cognitiveservices.azure.com/.default"] ) as conn: # Update session with instructions await conn.session.update(session={ "instructions": "You are a helpful assistant.", "modalities": ["text", "audio"], "voice": "alloy" }) # Listen for events async for event in conn: print(f"Event: {event.type}") if event.type == "response.audio_transcript.done": print(f"Transcript: {event.transcript}") elif event.type == "response.done": break asyncio.run(main())
Core Architecture
Connection Resources
The
VoiceLiveConnection exposes these resources:
| Resource | Purpose | Key Methods |
|---|---|---|
| Session configuration | |
| Model responses | , |
| Audio input | , , |
| Audio output | |
| Conversation state | , , |
| Transcription config | |
Session Configuration
from azure.ai.voicelive.models import RequestSession, FunctionTool await conn.session.update(session=RequestSession( instructions="You are a helpful voice assistant.", modalities=["text", "audio"], voice="alloy", # or "echo", "shimmer", "sage", etc. input_audio_format="pcm16", output_audio_format="pcm16", turn_detection={ "type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 500 }, tools=[ FunctionTool( type="function", name="get_weather", description="Get current weather", parameters={ "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } ) ] ))
Audio Streaming
Send Audio (Base64 PCM16)
import base64 # Read audio chunk (16-bit PCM, 24kHz mono) audio_chunk = await read_audio_from_microphone() b64_audio = base64.b64encode(audio_chunk).decode() await conn.input_audio_buffer.append(audio=b64_audio)
Receive Audio
async for event in conn: if event.type == "response.audio.delta": audio_bytes = base64.b64decode(event.delta) await play_audio(audio_bytes) elif event.type == "response.audio.done": print("Audio complete")
Event Handling
async for event in conn: match event.type: # Session events case "session.created": print(f"Session: {event.session}") case "session.updated": print("Session updated") # Audio input events case "input_audio_buffer.speech_started": print(f"Speech started at {event.audio_start_ms}ms") case "input_audio_buffer.speech_stopped": print(f"Speech stopped at {event.audio_end_ms}ms") # Transcription events case "conversation.item.input_audio_transcription.completed": print(f"User said: {event.transcript}") case "conversation.item.input_audio_transcription.delta": print(f"Partial: {event.delta}") # Response events case "response.created": print(f"Response started: {event.response.id}") case "response.audio_transcript.delta": print(event.delta, end="", flush=True) case "response.audio.delta": audio = base64.b64decode(event.delta) case "response.done": print(f"Response complete: {event.response.status}") # Function calls case "response.function_call_arguments.done": result = handle_function(event.name, event.arguments) await conn.conversation.item.create(item={ "type": "function_call_output", "call_id": event.call_id, "output": json.dumps(result) }) await conn.response.create() # Errors case "error": print(f"Error: {event.error.message}")
Common Patterns
Manual Turn Mode (No VAD)
await conn.session.update(session={"turn_detection": None}) # Manually control turns await conn.input_audio_buffer.append(audio=b64_audio) await conn.input_audio_buffer.commit() # End of user turn await conn.response.create() # Trigger response
Interrupt Handling
async for event in conn: if event.type == "input_audio_buffer.speech_started": # User interrupted - cancel current response await conn.response.cancel() await conn.output_audio_buffer.clear()
Conversation History
# Add system message await conn.conversation.item.create(item={ "type": "message", "role": "system", "content": [{"type": "input_text", "text": "Be concise."}] }) # Add user message await conn.conversation.item.create(item={ "type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello!"}] }) await conn.response.create()
Voice Options
| Voice | Description |
|---|---|
| Neutral, balanced |
| Warm, conversational |
| Clear, professional |
| Calm, authoritative |
| Friendly, upbeat |
| Deep, measured |
| Expressive |
| Storytelling |
Azure voices: Use
AzureStandardVoice, AzureCustomVoice, or AzurePersonalVoice models.
Audio Formats
| Format | Sample Rate | Use Case |
|---|---|---|
| 24kHz | Default, high quality |
| 8kHz | Telephony |
| 16kHz | Voice assistants |
| 8kHz | Telephony (US) |
| 8kHz | Telephony (EU) |
Turn Detection Options
# Server VAD (default) {"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500} # Azure Semantic VAD (smarter detection) {"type": "azure_semantic_vad"} {"type": "azure_semantic_vad_en"} # English optimized {"type": "azure_semantic_vad_multilingual"}
Error Handling
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed try: async with connect(...) as conn: async for event in conn: if event.type == "error": print(f"API Error: {event.error.code} - {event.error.message}") except ConnectionClosed as e: print(f"Connection closed: {e.code} - {e.reason}") except ConnectionError as e: print(f"Connection error: {e}")
References
- Detailed API Reference: See references/api-reference.md
- Complete Examples: See references/examples.md
- All Models & Types: See references/models.md