Trending-skills omnivoice-tts

Expert skill for OmniVoice, a massively multilingual zero-shot TTS model supporting 600+ languages with voice cloning and voice design capabilities.

install

source · Clone the upstream repo

git clone https://github.com/Aradotso/trending-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/omnivoice-tts" ~/.claude/skills/aradotso-trending-skills-omnivoice-tts && rm -rf "$T"

manifest: skills/omnivoice-tts/SKILL.md

source content

OmniVoice TTS Skill

Skill by ara.so — Daily 2026 Skills collection.

OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.

Installation

Requirements

Python 3.9+
PyTorch 2.8+
CUDA (recommended) or Apple Silicon (MPS) or CPU

pip (recommended)

# Step 1: Install PyTorch for your platform

# NVIDIA GPU (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

# Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0

# Step 2: Install OmniVoice
pip install omnivoice

# Or from source (latest)
pip install git+https://github.com/k2-fsa/OmniVoice.git

# Or editable dev install
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .

uv

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
# With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

HuggingFace Mirror (if blocked)

export HF_ENDPOINT="https://hf-mirror.com"

Core Concepts

Mode What you provide Use case

Voice Cloning

ref_audio

ref_text

Clone a speaker from a short audio clip

Voice Design

instruct

string

Describe speaker attributes (no audio needed)

Auto Voice nothing extra Model picks a random voice

Python API

Load the Model

from omnivoice import OmniVoice
import torch
import torchaudio

# NVIDIA GPU
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

# Apple Silicon
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="mps",
    dtype=torch.float16
)

# CPU (slower)
model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cpu",
    dtype=torch.float32
)

Voice Cloning

# With manual reference transcription (faster, more accurate)
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

# Without ref_text — Whisper auto-transcribes ref_audio
audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
)

# audio is a list of torch.Tensor, shape (1, T) at 24kHz
torchaudio.save("out.wav", audio[0], 24000)

Voice Design

# Describe speaker via comma-separated attributes
audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)
torchaudio.save("out.wav", audio[0], 24000)

Supported attributes:

Gender:
```
male
```
,
```
female
```
Age:
```
child
```
,
```
young
```
,
```
middle-aged
```
,
```
elderly
```

Pitch:

very low pitch

low pitch

high pitch

very high pitch

Style:
```
whisper
```

English accents:

american accent

british accent

australian accent

, etc.

Chinese dialects:
```
四川话
```
,
```
陕西话
```
, etc.

Auto Voice

audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)

Generation Parameters

audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # diffusion steps; use 16 for faster (slightly lower quality)
    speed=1.2,        # speaking rate multiplier (>1 faster, <1 slower)
    duration=8.0,     # fix output duration in seconds (overrides speed)
)

Non-Verbal Symbols

# Insert expressive non-verbal sounds inline
audio = model.generate(
    text="[laughter] You really got me. I didn't see that coming at all."
)

Supported tags:

[laughter]

[sigh]

[confirmation-en]

[question-en]

[question-ah]

[question-oh]

[question-ei]

[question-yi]

[surprise-ah]

[surprise-oh]

[surprise-wa]

[surprise-yo]

[dissatisfaction-hnn]

Pronunciation Control

# Chinese: pinyin with tone numbers (inline, uppercase)
audio = model.generate(
    text="这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。"
)

# English: CMU dict pronunciation in brackets (uppercase)
audio = model.generate(
    text="You could probably still make [IH1 T] look good."
)

CLI Tools

Web Demo

omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # all options

Single Inference

# Voice Cloning (ref_text optional; omit for Whisper auto-transcription)
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --ref_audio ref.wav \
    --ref_text "Transcription of the reference audio." \
    --output hello.wav

# Voice Design
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --instruct "male, British accent" \
    --output hello.wav

# Auto Voice
omnivoice-infer \
    --model k2-fsa/OmniVoice \
    --text "This is a test for text to speech." \
    --output hello.wav

Batch Inference (Multi-GPU)

omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

JSONL format (

test.jsonl

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}

JSONL field reference:

Field	Required	Description
`id`	✅	Unique identifier
`text`	✅	Text to synthesize
`ref_audio`	❌	Path to reference audio (voice cloning)
`ref_text`	❌	Transcript of ref audio
`instruct`	❌	Speaker attributes (voice design)
`language_id`	❌	Language code, e.g. `"en"`
`language_name`	❌	Language name, e.g. `"English"`
`duration`	❌	Fixed output duration in seconds
`speed`	❌	Speaking rate multiplier (ignored if duration set)

Common Patterns

Full Voice Cloning Pipeline

from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # ref_text omitted: Whisper auto-transcribes
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)

Batch Processing from a List

import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)

Voice Design Combinations

designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)

Fast Inference (Lower Diffusion Steps)

# Default: num_step=32 (high quality)
# Fast: num_step=16 (slightly lower quality, ~2x faster)
audio = model.generate(
    text="Fast inference example.",
    ref_audio="ref.wav",
    num_step=16,
)

Output Format

Sample rate: 24,000 Hz
Type:
```
list[torch.Tensor]
```
, each tensor shape
```
(1, T)
```
Save: use
```
torchaudio.save(path, audio[0], 24000)
```

Troubleshooting

HuggingFace download fails

export HF_ENDPOINT="https://hf-mirror.com"

CUDA out of memory

# Use float16 (not float32)
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
# Or reduce batch size / text length in batch inference

Whisper ASR not available for ref_text auto-transcription

pip install openai-whisper

Wrong pronunciation in Chinese

Use inline pinyin with tone numbers directly in the text string:

# Format: PINYINTONE_NUMBER within the sentence
text = "这批货物打ZHE2出售"

Audio quality issues

Increase
```
num_step
```
to 32 or 64
Provide
```
ref_text
```
manually instead of relying on auto-transcription
Use a clean, noise-free reference audio clip (3–15 seconds recommended)

Apple Silicon (MPS) issues

# Use mps device explicitly
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

Model & Resources

Resource	Link
HuggingFace Model	`k2-fsa/OmniVoice`
HuggingFace Space	https://huggingface.co/spaces/k2-fsa/OmniVoice
Paper (arXiv)	https://arxiv.org/abs/2604.00688
Demo Page	https://zhu-han.github.io/omnivoice
Supported Languages	`docs/languages.md` in repo
Voice Design Attributes	`docs/voice-design.md` in repo
Generation Parameters	`docs/generation-parameters.md` in repo
Training/Eval Examples	`examples/` in repo