Trending-skills omnivoice-tts
Expert skill for OmniVoice, a massively multilingual zero-shot TTS model supporting 600+ languages with voice cloning and voice design capabilities.
install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/omnivoice-tts" ~/.claude/skills/aradotso-trending-skills-omnivoice-tts && rm -rf "$T"
manifest:
skills/omnivoice-tts/SKILL.mdsource content
OmniVoice TTS Skill
Skill by ara.so — Daily 2026 Skills collection.
OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.
Installation
Requirements
- Python 3.9+
- PyTorch 2.8+
- CUDA (recommended) or Apple Silicon (MPS) or CPU
pip (recommended)
# Step 1: Install PyTorch for your platform # NVIDIA GPU (CUDA 12.8) pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128 # Apple Silicon pip install torch==2.8.0 torchaudio==2.8.0 # Step 2: Install OmniVoice pip install omnivoice # Or from source (latest) pip install git+https://github.com/k2-fsa/OmniVoice.git # Or editable dev install git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e .
uv
git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice uv sync # With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
HuggingFace Mirror (if blocked)
export HF_ENDPOINT="https://hf-mirror.com"
Core Concepts
| Mode | What you provide | Use case |
|---|---|---|
| Voice Cloning | + | Clone a speaker from a short audio clip |
| Voice Design | string | Describe speaker attributes (no audio needed) |
| Auto Voice | nothing extra | Model picks a random voice |
Python API
Load the Model
from omnivoice import OmniVoice import torch import torchaudio # NVIDIA GPU model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16 ) # Apple Silicon model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16 ) # CPU (slower) model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cpu", dtype=torch.float32 )
Voice Cloning
# With manual reference transcription (faster, more accurate) audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", ref_text="Transcription of the reference audio.", ) # Without ref_text — Whisper auto-transcribes ref_audio audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", ) # audio is a list of torch.Tensor, shape (1, T) at 24kHz torchaudio.save("out.wav", audio[0], 24000)
Voice Design
# Describe speaker via comma-separated attributes audio = model.generate( text="Hello, this is a test of zero-shot voice design.", instruct="female, low pitch, british accent", ) torchaudio.save("out.wav", audio[0], 24000)
Supported attributes:
- Gender:
,malefemale - Age:
,child
,young
,middle-agedelderly - Pitch:
,very low pitch
,low pitch
,high pitchvery high pitch - Style:
whisper - English accents:
,american accent
,british accent
, etc.australian accent - Chinese dialects:
,四川话
, etc.陕西话
Auto Voice
audio = model.generate(text="This is a sentence without any voice prompt.") torchaudio.save("out.wav", audio[0], 24000)
Generation Parameters
audio = model.generate( text="Hello world.", ref_audio="ref.wav", ref_text="Reference text.", num_step=32, # diffusion steps; use 16 for faster (slightly lower quality) speed=1.2, # speaking rate multiplier (>1 faster, <1 slower) duration=8.0, # fix output duration in seconds (overrides speed) )
Non-Verbal Symbols
# Insert expressive non-verbal sounds inline audio = model.generate( text="[laughter] You really got me. I didn't see that coming at all." )
Supported tags:
[laughter], [sigh], [confirmation-en], [question-en], [question-ah],
[question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh],
[surprise-wa], [surprise-yo], [dissatisfaction-hnn]
Pronunciation Control
# Chinese: pinyin with tone numbers (inline, uppercase) audio = model.generate( text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。" ) # English: CMU dict pronunciation in brackets (uppercase) audio = model.generate( text="You could probably still make [IH1 T] look good." )
CLI Tools
Web Demo
omnivoice-demo --ip 0.0.0.0 --port 8001 omnivoice-demo --help # all options
Single Inference
# Voice Cloning (ref_text optional; omit for Whisper auto-transcription) omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --ref_audio ref.wav \ --ref_text "Transcription of the reference audio." \ --output hello.wav # Voice Design omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --instruct "male, British accent" \ --output hello.wav # Auto Voice omnivoice-infer \ --model k2-fsa/OmniVoice \ --text "This is a test for text to speech." \ --output hello.wav
Batch Inference (Multi-GPU)
omnivoice-infer-batch \ --model k2-fsa/OmniVoice \ --test_list test.jsonl \ --res_dir results/
JSONL format (
test.jsonl):
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"} {"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"} {"id": "sample_003", "text": "Auto voice example"} {"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2} {"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0} {"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}
JSONL field reference:
| Field | Required | Description |
|---|---|---|
| ✅ | Unique identifier |
| ✅ | Text to synthesize |
| ❌ | Path to reference audio (voice cloning) |
| ❌ | Transcript of ref audio |
| ❌ | Speaker attributes (voice design) |
| ❌ | Language code, e.g. |
| ❌ | Language name, e.g. |
| ❌ | Fixed output duration in seconds |
| ❌ | Speaking rate multiplier (ignored if duration set) |
Common Patterns
Full Voice Cloning Pipeline
from omnivoice import OmniVoice import torch import torchaudio from pathlib import Path def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str): model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16 ) Path(output_dir).mkdir(parents=True, exist_ok=True) for i, text in enumerate(texts): audio = model.generate( text=text, ref_audio=ref_audio_path, # ref_text omitted: Whisper auto-transcribes num_step=32, speed=1.0, ) out_path = f"{output_dir}/output_{i:04d}.wav" torchaudio.save(out_path, audio[0], 24000) print(f"Saved: {out_path}") clone_voice( ref_audio_path="speaker.wav", texts=["Hello world.", "Second sentence.", "Third sentence."], output_dir="outputs/" )
Batch Processing from a List
import json from omnivoice import OmniVoice import torch import torchaudio model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16) items = [ {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"}, {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"}, {"id": "s3", "text": "Auto voice.", }, ] for item in items: kwargs = {"text": item["text"]} if "ref_audio" in item: kwargs["ref_audio"] = item["ref_audio"] if "ref_text" in item: kwargs["ref_text"] = item["ref_text"] if "instruct" in item: kwargs["instruct"] = item["instruct"] audio = model.generate(**kwargs) torchaudio.save(f"{item['id']}.wav", audio[0], 24000)
Voice Design Combinations
designs = [ "male, elderly, low pitch", "female, child, high pitch", "male, whisper", "female, british accent, high pitch", "male, american accent, middle-aged", ] for design in designs: audio = model.generate( text="The quick brown fox jumps over the lazy dog.", instruct=design, ) safe_name = design.replace(", ", "_").replace(" ", "-") torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)
Fast Inference (Lower Diffusion Steps)
# Default: num_step=32 (high quality) # Fast: num_step=16 (slightly lower quality, ~2x faster) audio = model.generate( text="Fast inference example.", ref_audio="ref.wav", num_step=16, )
Output Format
- Sample rate: 24,000 Hz
- Type:
, each tensor shapelist[torch.Tensor](1, T) - Save: use
torchaudio.save(path, audio[0], 24000)
Troubleshooting
HuggingFace download fails
export HF_ENDPOINT="https://hf-mirror.com"
CUDA out of memory
# Use float16 (not float32) model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16) # Or reduce batch size / text length in batch inference
Whisper ASR not available for ref_text auto-transcription
pip install openai-whisper
Wrong pronunciation in Chinese
Use inline pinyin with tone numbers directly in the text string:
# Format: PINYINTONE_NUMBER within the sentence text = "这批货物打ZHE2出售"
Audio quality issues
- Increase
to 32 or 64num_step - Provide
manually instead of relying on auto-transcriptionref_text - Use a clean, noise-free reference audio clip (3–15 seconds recommended)
Apple Silicon (MPS) issues
# Use mps device explicitly model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)
Model & Resources
| Resource | Link |
|---|---|
| HuggingFace Model | |
| HuggingFace Space | https://huggingface.co/spaces/k2-fsa/OmniVoice |
| Paper (arXiv) | https://arxiv.org/abs/2604.00688 |
| Demo Page | https://zhu-han.github.io/omnivoice |
| Supported Languages | in repo |
| Voice Design Attributes | in repo |
| Generation Parameters | in repo |
| Training/Eval Examples | in repo |