Skills vibe-voice
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vibe-voice" ~/.claude/skills/terminalskills-skills-vibe-voice && rm -rf "$T"
manifest:
skills/vibe-voice/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
VibeVoice
Open-source frontier voice AI from Microsoft. Family of models for text-to-speech (TTS), automatic speech recognition (ASR), and real-time streaming synthesis.
GitHub: microsoft/VibeVoice
Overview
VibeVoice is Microsoft's open-source voice AI platform offering three model sizes: ASR (7B) for transcription with speaker diarization, TTS (1.5B) for multi-speaker long-form synthesis, and Realtime (0.5B) for low-latency streaming. It supports 50+ languages and runs on consumer GPUs.
Instructions
Installation
pip install vibevoice # Or clone for development git clone https://github.com/microsoft/VibeVoice.git cd VibeVoice pip install -e .
Hardware Requirements
- ASR (7B): ~16GB VRAM (GPU recommended)
- TTS (1.5B): ~6GB VRAM
- Realtime (0.5B): ~2GB VRAM (runs on consumer GPUs)
Text-to-Speech (TTS)
from vibevoice import VibeVoiceTTS model = VibeVoiceTTS.from_pretrained("microsoft/VibeVoice-1.5B") # Single speaker audio = model.synthesize( text="Hello, welcome to the future of voice AI.", speaker="default" ) audio.save("output.wav")
Multi-Speaker Conversation
Generate podcast-style audio with up to 4 distinct speakers:
conversation = [ {"speaker": "host", "text": "Welcome to the show! Today we're discussing AI."}, {"speaker": "guest1", "text": "Thanks for having me. I'm excited to dive in."}, {"speaker": "host", "text": "Let's start with the biggest trends you're seeing."}, {"speaker": "guest2", "text": "I think voice AI is the most underrated development."}, ] audio = model.synthesize_conversation(conversation) audio.save("podcast.wav") # Up to 90 minutes in a single pass
Real-Time Streaming TTS
from vibevoice import VibeVoiceRealtime model = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B") for audio_chunk in model.stream("This is being generated in real time."): play_audio(audio_chunk)
Automatic Speech Recognition (ASR)
from vibevoice import VibeVoiceASR model = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR") # Basic transcription result = model.transcribe("meeting_recording.wav") print(result.text) # Rich transcription with diarization result = model.transcribe("meeting.wav", diarize=True, timestamps=True) for segment in result.segments: print(f"[{segment.start:.1f}s - {segment.end:.1f}s] " f"Speaker {segment.speaker}: {segment.text}")
Custom Hotwords
result = model.transcribe( "medical_consultation.wav", hotwords=["Lisinopril", "Metformin", "HbA1c", "systolic"] )
Examples
Example 1: Build a Voice Assistant
from vibevoice import VibeVoiceASR, VibeVoiceRealtime asr = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR") tts = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B") # Listen → Transcribe → Respond → Speak user_text = asr.transcribe("user_input.wav").text response = generate_response(user_text) # Your LLM call for chunk in tts.stream(response): play_audio(chunk)
Example 2: Meeting Transcription with Speaker Labels
result = asr.transcribe("team_standup.wav", diarize=True, timestamps=True) for segment in result.segments: print(f"[{segment.start:.1f}s] Speaker {segment.speaker}: {segment.text}") # Output: # [0.0s] Speaker 1: Let's review the Q3 numbers. # [3.5s] Speaker 2: Revenue is up 15% from last quarter. # [8.4s] Speaker 1: That's great. What about customer acquisition?
Guidelines
- Use the Realtime (0.5B) model for voice assistants where latency matters
- The ASR model handles up to 60 minutes of audio in a single pass
- TTS supports up to 90 minutes of multi-speaker audio generation
- Use custom hotwords for domain-specific terms (medical, legal, technical)
- For production, consider vLLM for faster ASR inference
- Multilingual voices are experimental — test quality before deploying
- 7.5 Hz frame rate enables efficient long-sequence processing