Learn-skills.dev qwen3-tts-mlx

Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agiseek/agent-skills/qwen3-tts-mlx" ~/.claude/skills/neversight-learn-skills-dev-qwen3-tts-mlx && rm -rf "$T"

manifest: data/skills-md/agiseek/agent-skills/qwen3-tts-mlx/SKILL.md

source content

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

Generate speech fully offline on a Mac
Produce narration, audiobooks, podcasts, or video voiceovers
Create multilingual TTS with controllable style and emotion
Clone any voice from a short audio sample
Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

Variant	Model	Size	Memory	Use Case
CustomVoice	`mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit`	~1GB	~4GB	Built-in voices + style control (recommended)
VoiceDesign	`mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit`	~2GB	~5GB	Create voices from text descriptions
Base	`mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit`	~1GB	~4GB	Voice cloning from reference audio

Supported Languages

Language	Code	Notes
Auto-detect	`auto`	Default, detects from text
Chinese	`Chinese`	Mandarin
English	`English`
Japanese	`Japanese`
Korean	`Korean`
French	`French`
German	`German`
Spanish	`Spanish`
Portuguese	`Portuguese`
Italian	`Italian`
Russian	`Russian`

Built-in Voices

Voice	Language	Character
Vivian	Chinese	Female, bright, young
Serena	Chinese	Female, gentle, soft
Uncle_Fu	Chinese	Male, authoritative, news anchor
Dylan	Chinese	Male, Beijing dialect
Eric	Chinese	Male, Sichuan dialect
Ryan	English	Male, energetic
Aiden	English	Male, clear, neutral
Ono_Anna	Japanese	Female
Sohee	Korean	Female

Voice Selection Guide:

Scenario	Recommended Voice
Chinese news/narration	Uncle_Fu
Chinese casual/lively	Eric
Chinese female, professional	Vivian
Chinese female, storytelling	Serena
English energetic content	Ryan
English neutral/educational	Aiden
Japanese content	Ono_Anna
Korean content	Sohee

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via

--instruct

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

```
"calm and warm"
```
- Soft, friendly delivery
```
"news anchor, authoritative"
```
- Professional broadcast style
```
"excited and energetic"
```
- High energy, enthusiastic
```
"sad and melancholic"
```
- Emotional, somber tone
```
"whispering, intimate"
```
- Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

```
"young cheerful female with high pitch"
```

"elderly wise male with deep resonant voice"

"professional female news anchor, clear articulation"

"friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

Use clean audio without background noise
5-10 seconds of speech works best
Provide accurate transcript of the reference
Reference and output language should match

CLI Parameters

Parameter	Required	Default	Description
`--text`	Yes	-	Text to synthesize
`--voice`	No	Vivian	Built-in voice (CustomVoice only)
`--lang_code`	No	auto	Language code
`--instruct`	No	-	Style control or voice description
`--speed`	No	1.0	Speech speed multiplier
`--temperature`	No	0.7	Sampling temperature (higher = more variation)
`--model`	No	(per mode)	Override default model
`--output`	No	-	Output file path
`--out-dir`	No	./outputs	Output directory when --output not set
`--ref_audio`	VoiceClone	-	Reference audio file
`--ref_text`	VoiceClone	-	Reference audio transcript

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use

scripts/batch_dubbing.py

for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See

references/dubbing_format.md

for the JSON format.

Performance

Metric	Value
Sample rate	24,000 Hz
Real-time factor	~0.7x (faster than real-time)
Peak memory	~4-6 GB
First run	Downloads model (~1-2GB)

Troubleshooting

Issue	Solution
Slow generation	Use 4-bit CustomVoice model
Unnatural pauses	Add punctuation, keep sentences short
Wrong language detected	Specify `--lang_code` explicitly
Voice cloning quality	Use cleaner reference audio, accurate transcript
Tokenizer warnings	Harmless, can be ignored
Out of memory	Close other apps, use 4-bit model