Skills azure-ai-voicelive-dotnet
install
source · Clone the upstream repo
git clone https://github.com/microsoft/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/microsoft/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.github/plugins/azure-sdk-dotnet/skills/azure-ai-voicelive-dotnet" ~/.claude/skills/microsoft-skills-azure-ai-voicelive-dotnet && rm -rf "$T"
manifest:
.github/plugins/azure-sdk-dotnet/skills/azure-ai-voicelive-dotnet/SKILL.mdsource content
Azure.AI.VoiceLive (.NET)
Real-time voice AI SDK for building bidirectional voice assistants with Azure AI.
Installation
dotnet add package Azure.AI.VoiceLive dotnet add package Azure.Identity dotnet add package NAudio # For audio capture/playback
Current Versions: Stable v1.0.0, Preview v1.1.0-beta.1
Environment Variables
AZURE_VOICELIVE_ENDPOINT=https://<resource>.services.ai.azure.com/ AZURE_VOICELIVE_MODEL=gpt-4o-realtime-preview AZURE_VOICELIVE_VOICE=en-US-AvaNeural # Optional: API key if not using Entra ID AZURE_VOICELIVE_API_KEY=<your-api-key>
Authentication
Microsoft Entra ID (Recommended)
using Azure.Identity; using Azure.AI.VoiceLive; Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com"); DefaultAzureCredential credential = new DefaultAzureCredential(); VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);
Required Role:
Cognitive Services User (assign in Azure Portal → Access control)
API Key
Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com"); AzureKeyCredential credential = new AzureKeyCredential("your-api-key"); VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);
Client Hierarchy
VoiceLiveClient └── VoiceLiveSession (WebSocket connection) ├── ConfigureSessionAsync() ├── GetUpdatesAsync() → SessionUpdate events ├── AddItemAsync() → UserMessageItem, FunctionCallOutputItem ├── SendAudioAsync() └── StartResponseAsync()
Core Workflow
1. Start Session and Configure
using Azure.Identity; using Azure.AI.VoiceLive; var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_VOICELIVE_ENDPOINT")); var client = new VoiceLiveClient(endpoint, new DefaultAzureCredential()); var model = "gpt-4o-mini-realtime-preview"; // Start session using VoiceLiveSession session = await client.StartSessionAsync(model); // Configure session VoiceLiveSessionOptions sessionOptions = new() { Model = model, Instructions = "You are a helpful AI assistant. Respond naturally.", Voice = new AzureStandardVoice("en-US-AvaNeural"), TurnDetection = new AzureSemanticVadTurnDetection() { Threshold = 0.5f, PrefixPadding = TimeSpan.FromMilliseconds(300), SilenceDuration = TimeSpan.FromMilliseconds(500) }, InputAudioFormat = InputAudioFormat.Pcm16, OutputAudioFormat = OutputAudioFormat.Pcm16 }; // Set modalities (both text and audio for voice assistants) sessionOptions.Modalities.Clear(); sessionOptions.Modalities.Add(InteractionModality.Text); sessionOptions.Modalities.Add(InteractionModality.Audio); await session.ConfigureSessionAsync(sessionOptions);
2. Process Events
await foreach (SessionUpdate serverEvent in session.GetUpdatesAsync()) { switch (serverEvent) { case SessionUpdateResponseAudioDelta audioDelta: byte[] audioData = audioDelta.Delta.ToArray(); // Play audio via NAudio or other audio library break; case SessionUpdateResponseTextDelta textDelta: Console.Write(textDelta.Delta); break; case SessionUpdateResponseFunctionCallArgumentsDone functionCall: // Handle function call (see Function Calling section) break; case SessionUpdateError error: Console.WriteLine($"Error: {error.Error.Message}"); break; case SessionUpdateResponseDone: Console.WriteLine("\n--- Response complete ---"); break; } }
3. Send User Message
await session.AddItemAsync(new UserMessageItem("Hello, can you help me?")); await session.StartResponseAsync();
4. Function Calling
// Define function var weatherFunction = new VoiceLiveFunctionDefinition("get_current_weather") { Description = "Get the current weather for a given location", Parameters = BinaryData.FromString(""" { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state or country" } }, "required": ["location"] } """) }; // Add to session options sessionOptions.Tools.Add(weatherFunction); // Handle function call in event loop if (serverEvent is SessionUpdateResponseFunctionCallArgumentsDone functionCall) { if (functionCall.Name == "get_current_weather") { var parameters = JsonSerializer.Deserialize<Dictionary<string, string>>(functionCall.Arguments); string location = parameters?["location"] ?? ""; // Call external service string weatherInfo = $"The weather in {location} is sunny, 75°F."; // Send response await session.AddItemAsync(new FunctionCallOutputItem(functionCall.CallId, weatherInfo)); await session.StartResponseAsync(); } }
Voice Options
| Voice Type | Class | Example |
|---|---|---|
| Azure Standard | | |
| Azure HD | | |
| Azure Custom | | Custom voice with endpoint ID |
Supported Models
| Model | Description |
|---|---|
| GPT-4o with real-time audio |
| Lightweight, fast interactions |
| Cost-effective multimodal |
Key Types Reference
| Type | Purpose |
|---|---|
| Main client for creating sessions |
| Active WebSocket session |
| Session configuration |
| Standard Azure voice provider |
| Voice activity detection |
| Function tool definition |
| User text message |
| Function call response |
| Audio chunk event |
| Text chunk event |
Best Practices
- Always set both modalities — Include
andText
for voice assistantsAudio - Use
— Provides natural conversation flowAzureSemanticVadTurnDetection - Configure appropriate silence duration — 500ms typical to avoid premature cutoffs
- Use
statement — Ensures proper session disposalusing - Handle all event types — Check for errors, audio, text, and function calls
- Use DefaultAzureCredential — Never hardcode API keys
Error Handling
if (serverEvent is SessionUpdateError error) { if (error.Error.Message.Contains("Cancellation failed: no active response")) { // Benign error, can ignore } else { Console.WriteLine($"Error: {error.Error.Message}"); } }
Audio Configuration
- Input Format:
(16-bit PCM)InputAudioFormat.Pcm16 - Output Format:
OutputAudioFormat.Pcm16 - Sample Rate: 24kHz recommended
- Channels: Mono
Related SDKs
| SDK | Purpose | Install |
|---|---|---|
| Real-time voice (this SDK) | |
| Speech-to-text, text-to-speech | |
| Audio capture/playback | |