Trending-skills moss-tts-nano-speech

Expert skill for using MOSS-TTS-Nano, a 0.1B parameter multilingual real-time TTS model that runs on CPU with voice cloning support.

install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/moss-tts-nano-speech" ~/.claude/skills/aradotso-trending-skills-moss-tts-nano-speech && rm -rf "$T"
manifest: skills/moss-tts-nano-speech/SKILL.md
source content

MOSS-TTS-Nano Speech Generation Skill

Skill by ara.so — Daily 2026 Skills collection.

MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.

Installation

Conda (recommended)

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

After

pip install -e .
the
moss-tts-nano
CLI command is available in the active environment.

Model Weights

Models are auto-downloaded from Hugging Face on first run:

  • TTS model:
    OpenMOSS-Team/MOSS-TTS-Nano
  • Audio tokenizer:
    OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano

ModelScope mirrors are available at

openmoss/MOSS-TTS-Nano
and
openmoss/MOSS-Audio-Tokenizer-Nano
.

CLI Commands

Generate speech (voice clone mode)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Output defaults to

generated_audio/moss_tts_nano_output.wav
.

Generate from a text file (long-form)

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

moss-tts-nano serve
# or directly:
python app.py

Opens at

http://127.0.0.1:18083
— model stays loaded in memory for fast repeated requests.

Direct Python entrypoint

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

Output:

generated_audio/infer_output.wav

Python API Usage

Basic voice clone inference

from infer import MossTTSNanoInference

# Initialize once (downloads weights on first run)
tts = MossTTSNanoInference()

# Voice clone: synthesize text in the style of the reference audio
audio = tts.infer(
    text="欢迎使用MOSS语音合成系统。",
    prompt_audio_path="assets/audio/zh_1.wav",
)

# Save output
import soundfile as sf
sf.write("output.wav", audio, samplerate=48000)

English voice clone

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

When the server is running (

moss-tts-nano serve
or
python app.py
):

import requests
import base64
import soundfile as sf
import io
import numpy as np

# Read reference audio as base64
with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://127.0.0.1:18083/generate",
    json={
        "text": "你好,这是一个语音合成测试。",
        "prompt_audio_base64": ref_audio_b64,
    },
)

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)

Streaming HTTP response (real-time web playback)

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例,适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

CodeLanguageCodeLanguageCodeLanguage
zhChineseenEnglishdeGerman
esSpanishfrFrenchjaJapanese
itItalianhuHungariankoKorean
ruRussianfaPersianarArabic
plPolishptPortuguesecsCzech
daDanishsvSwedishelGreek
trTurkish

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

Architecture Overview

  • Pipeline: Audio Tokenizer + LLM (pure autoregressive)
  • Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
  • Output: 48 kHz, 2-channel (stereo)
  • Token rate: 12.5 Hz token stream
  • Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
  • LLM: ~0.1B parameters total

Key CLI Flags

FlagAliasDescription
--prompt-audio-path
Path to reference WAV for voice cloning (
infer.py
)
--prompt-speech
Same purpose in
moss-tts-nano generate
CLI
--text
Input text string
--text-file
Path to plain text file for long-form synthesis
--output
Output WAV file path (default varies by entrypoint)

Common Patterns

Pattern: Batch synthesis with one reference voice

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话,用于批量合成测试。",
    "第二句话,保持相同的音色。",
    "第三句话,输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()

Troubleshooting

WeTextProcessing install fails

# Use conda to get pynini, then install from source
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

Model download is slow or fails

Set

HF_ENDPOINT
to a mirror if Hugging Face is unreachable:

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

Or use ModelScope:

pip install modelscope

Then point model paths to

openmoss/MOSS-TTS-Nano
and
openmoss/MOSS-Audio-Tokenizer-Nano
.

Out of memory on CPU

  • Use streaming inference (
    infer_stream
    ) to reduce peak memory.
  • Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
  • Close other applications; the model needs ~1–2 GB RAM.

Audio output is silent or corrupt

  • Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
  • Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
  • Avoid reference audio with heavy background noise.

moss-tts-nano
command not found

# Re-run editable install inside the active conda env
pip install -e .
which moss-tts-nano   # should resolve now

Port conflict for web demo

# Default port is 18083; check what occupies it
lsof -i :18083
# Kill if needed, then relaunch
moss-tts-nano serve

Output Defaults

EntrypointDefault output path
python infer.py
generated_audio/infer_output.wav
moss-tts-nano generate
generated_audio/moss_tts_nano_output.wav
python app.py
/
moss-tts-nano serve
returned via HTTP response

The

generated_audio/
directory is created automatically if it does not exist.