Happy-claude-skills happy-audio-gen

Universal AI voice / text-to-speech skill supporting OpenAI TTS (gpt-4o-mini-tts, tts-1), ElevenLabs multilingual TTS with voice cloning, Bailian Qwen TTS (qwen-tts / qwen3-tts-vd with voice-design custom voices, long-text chunking built in), MiniMax speech-02-hd, SiliconFlow CosyVoice / SenseVoice, and PlayHT 2.0. Use this skill whenever the user asks to read text aloud, synthesize speech, generate narration, create voice-over, dub a script, or turn any text into audio (mp3 / wav / ogg / flac). Typical phrases include "read this aloud", "generate voice for ...", "create a narration of ...", "tts this", "把这段念出来", "做个配音", "合成语音", or mentions of voices / TTS model names like Alloy, Ash, Cherry, Rachel, CosyVoice, PlayHT. Always use this skill even if the user does not specify a provider — pick one from EXTEND.md defaults or available env keys.

install

source · Clone the upstream repo

git clone https://github.com/iamzhihuix/happy-claude-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/iamzhihuix/happy-claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/happy-audio-gen" ~/.claude/skills/iamzhihuix-happy-claude-skills-happy-audio-gen && rm -rf "$T"

manifest: skills/happy-audio-gen/SKILL.md

source content

happy-audio-gen

Turns text into speech across 6 providers through one CLI. All providers are synchronous (TTS is fast — typically under 10 seconds) except Bailian's voice-design flow (which is still covered but uses a longer poll window).

Quick usage

# Shortest path — OpenAI default voice
bun scripts/main.ts --text "Hello, world" --out ./hello.mp3

# Chinese, MiniMax
bun scripts/main.ts --provider minimax --text "大家好" --voice male-qn-qingse --out ./hello.mp3

# Long-form, Bailian (auto-splits by sentence)
bun scripts/main.ts --provider bailian --textfiles ./script.md --out ./narration.mp3

When to invoke this skill

User asks to synthesize speech / TTS / read aloud / narrate / dub / make a voice-over.
User asks to convert script / text / article into audio.
User names a TTS voice or model.

Do not route here when the user wants to transcribe audio → text (that's STT, different domain), or edit / mix audio files (use a dedicated audio editor).

Step 0: Preflight (BLOCKING)

Locate EXTEND.md:

./.happy-skills/happy-audio-gen/EXTEND.md

$XDG_CONFIG_HOME/happy-skills/happy-audio-gen/EXTEND.md

~/.happy-skills/happy-audio-gen/EXTEND.md

If none found, run

bun scripts/main.ts --setup

and walk the user through

references/config/first-time-setup.md

Verify at least one provider has credentials (env var or 1Password reference).
Verify Bun is available. Fallback:
```
npx -y bun
```
.

Step 1: Choose provider

Preference order:

```
--provider <id>
```
EXTEND.md
```
default_provider
```

Auto-detect env vars:

openai > elevenlabs > bailian > minimax > siliconflow > playht

Pick by language / voice intent:

English, natural + fast →
```
openai
```
(gpt-4o-mini-tts / tts-1).
Multilingual, voice cloning →
```
elevenlabs
```
.
Chinese, long-form →
```
bailian
```
(qwen-tts auto-chunks long scripts) or
```
minimax
```
.
Chinese dialect / voice design →
```
bailian
```
(voice-design with qwen3-tts-vd) or
```
siliconflow
```
(CosyVoice2).
Ultra-realistic, short-form →
```
playht
```
(2.0).

Step 2: Fill parameters

--text
or --textfiles
: input. Always quote.
--out <path>
: REQUIRED. Extension determines format (
```
.mp3
```
/
```
.wav
```
/
```
.ogg
```
/
```
.flac
```
).
--voice <id>
: provider-specific. See
```
references/voices.md
```
for the short list of well-known voices.
--rate 0.5..2.0
: speaking rate.
--instruction "..."
: voice direction (only
```
openai
```
gpt-4o-mini-tts and
```
siliconflow
```
honor this).
--language <code>
:
```
en
```
,
```
zh
```
,
```
ja
```
— only a few providers honor this explicitly.

Step 3: Run

bun scripts/main.ts \
  --provider openai \
  --model gpt-4o-mini-tts \
  --voice alloy \
  --text "..." \
  --out ./out.mp3

JSON mode:

{ "success": true, "provider": "openai", "model": "gpt-4o-mini-tts", "voice": "alloy", "output": "/abs/out.mp3", "size_bytes": 76032, "format": "mp3" }

Step 4: Long text handling

```
happy-audio-gen
```
automatically splits long input for providers that cap per-call length (Bailian ≤ 200 Chinese chars per call). Chunks are concatenated byte-for-byte on output.
For best fidelity with concatenated MP3s, stitch the segments with ffmpeg afterward rather than relying on byte concat.

Step 5: Errors

[openai] OpenAI TTS 400

with

invalid voice

→ the voice name is not supported by the model. Use one of

alloy

ash

coral

echo

fable

onyx

nova

sage

shimmer

[minimax] ... 2049 invalid api key

→ try

MINIMAX_BASE_URL=https://api.minimaxi.com/v1

(different region).

```
[bailian] ... 400 DataInspectionFailed
```
→ Aliyun content filter. Surface to the user.
```
[elevenlabs] 401
```
→ key invalid or subscription expired.

References

```
references/providers.md
```
— per-provider env vars, default models, voice lists.
```
references/voices.md
```
— curated voices for each provider.
```
references/error_codes.md
```
— common errors and fixes.
```
references/config/first-time-setup.md
```
```
references/config/extend-schema.md
```
```
assets/EXTEND.template.md
```