Happy-claude-skills happy-audio-gen
Universal AI voice / text-to-speech skill supporting OpenAI TTS (gpt-4o-mini-tts, tts-1), ElevenLabs multilingual TTS with voice cloning, Bailian Qwen TTS (qwen-tts / qwen3-tts-vd with voice-design custom voices, long-text chunking built in), MiniMax speech-02-hd, SiliconFlow CosyVoice / SenseVoice, and PlayHT 2.0. Use this skill whenever the user asks to read text aloud, synthesize speech, generate narration, create voice-over, dub a script, or turn any text into audio (mp3 / wav / ogg / flac). Typical phrases include "read this aloud", "generate voice for ...", "create a narration of ...", "tts this", "把这段念出来", "做个配音", "合成语音", or mentions of voices / TTS model names like Alloy, Ash, Cherry, Rachel, CosyVoice, PlayHT. Always use this skill even if the user does not specify a provider — pick one from EXTEND.md defaults or available env keys.
git clone https://github.com/iamzhihuix/happy-claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/iamzhihuix/happy-claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/happy-audio-gen" ~/.claude/skills/iamzhihuix-happy-claude-skills-happy-audio-gen && rm -rf "$T"
skills/happy-audio-gen/SKILL.mdhappy-audio-gen
Turns text into speech across 6 providers through one CLI. All providers are synchronous (TTS is fast — typically under 10 seconds) except Bailian's voice-design flow (which is still covered but uses a longer poll window).
Quick usage
# Shortest path — OpenAI default voice bun scripts/main.ts --text "Hello, world" --out ./hello.mp3 # Chinese, MiniMax bun scripts/main.ts --provider minimax --text "大家好" --voice male-qn-qingse --out ./hello.mp3 # Long-form, Bailian (auto-splits by sentence) bun scripts/main.ts --provider bailian --textfiles ./script.md --out ./narration.mp3
When to invoke this skill
- User asks to synthesize speech / TTS / read aloud / narrate / dub / make a voice-over.
- User asks to convert script / text / article into audio.
- User names a TTS voice or model.
Do not route here when the user wants to transcribe audio → text (that's STT, different domain), or edit / mix audio files (use a dedicated audio editor).
Step 0: Preflight (BLOCKING)
-
Locate EXTEND.md:
./.happy-skills/happy-audio-gen/EXTEND.md$XDG_CONFIG_HOME/happy-skills/happy-audio-gen/EXTEND.md~/.happy-skills/happy-audio-gen/EXTEND.md
If none found, run
and walk the user throughbun scripts/main.ts --setup
.references/config/first-time-setup.md -
Verify at least one provider has credentials (env var or 1Password reference).
-
Verify Bun is available. Fallback:
.npx -y bun
Step 1: Choose provider
Preference order:
--provider <id>- EXTEND.md
default_provider - Auto-detect env vars:
openai > elevenlabs > bailian > minimax > siliconflow > playht
Pick by language / voice intent:
- English, natural + fast →
(gpt-4o-mini-tts / tts-1).openai - Multilingual, voice cloning →
.elevenlabs - Chinese, long-form →
(qwen-tts auto-chunks long scripts) orbailian
.minimax - Chinese dialect / voice design →
(voice-design with qwen3-tts-vd) orbailian
(CosyVoice2).siliconflow - Ultra-realistic, short-form →
(2.0).playht
Step 2: Fill parameters
or--text
: input. Always quote.--textfiles
: REQUIRED. Extension determines format (--out <path>
/.mp3
/.wav
/.ogg
)..flac
: provider-specific. See--voice <id>
for the short list of well-known voices.references/voices.md
: speaking rate.--rate 0.5..2.0
: voice direction (only--instruction "..."
gpt-4o-mini-tts andopenai
honor this).siliconflow
:--language <code>
,en
,zh
— only a few providers honor this explicitly.ja
Step 3: Run
bun scripts/main.ts \ --provider openai \ --model gpt-4o-mini-tts \ --voice alloy \ --text "..." \ --out ./out.mp3
JSON mode:
{ "success": true, "provider": "openai", "model": "gpt-4o-mini-tts", "voice": "alloy", "output": "/abs/out.mp3", "size_bytes": 76032, "format": "mp3" }
Step 4: Long text handling
automatically splits long input for providers that cap per-call length (Bailian ≤ 200 Chinese chars per call). Chunks are concatenated byte-for-byte on output.happy-audio-gen- For best fidelity with concatenated MP3s, stitch the segments with ffmpeg afterward rather than relying on byte concat.
Step 5: Errors
with[openai] OpenAI TTS 400
→ the voice name is not supported by the model. Use one ofinvalid voice
,alloy
,ash
,coral
,echo
,fable
,onyx
,nova
,sage
.shimmer
→ try[minimax] ... 2049 invalid api key
(different region).MINIMAX_BASE_URL=https://api.minimaxi.com/v1
→ Aliyun content filter. Surface to the user.[bailian] ... 400 DataInspectionFailed
→ key invalid or subscription expired.[elevenlabs] 401
References
— per-provider env vars, default models, voice lists.references/providers.md
— curated voices for each provider.references/voices.md
— common errors and fixes.references/error_codes.mdreferences/config/first-time-setup.mdreferences/config/extend-schema.mdassets/EXTEND.template.md