OpenMontage text-to-speech
install
source · Clone the upstream repo
git clone https://github.com/calesthio/OpenMontage
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/calesthio/OpenMontage "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/text-to-speech" ~/.claude/skills/calesthio-openmontage-text-to-speech && rm -rf "$T"
manifest:
.agents/skills/text-to-speech/SKILL.mdsource content
Text-to-Speech (HeyGen Starfish)
Generate speech audio files from text using HeyGen's in-house Starfish TTS model. This skill is for standalone audio generation — separate from video creation.
Authentication
All requests require the
X-Api-Key header. Set the HEYGEN_API_KEY environment variable.
curl -X GET "https://api.heygen.com/v1/audio/voices" \ -H "X-Api-Key: $HEYGEN_API_KEY"
Tool Selection
If HeyGen MCP tools are available (
mcp__heygen__*), prefer them over direct HTTP API calls.
| Task | MCP Tool | Fallback (Direct API) |
|---|---|---|
| List TTS voices | | |
| Generate speech audio | | |
Default Workflow
- List voices with
(ormcp__heygen__list_audio_voices
)GET /v1/audio/voices - Pick a voice matching desired language, gender, and features
- Call
(ormcp__heygen__text_to_speech
) with text and voice_idPOST /v1/audio/text_to_speech - Use the returned
to download or play the audioaudio_url
List TTS Voices
Retrieve voices compatible with the Starfish TTS model.
Note: This uses
— a different endpoint from the video voices API (GET /v1/audio/voices). Not all video voices support Starfish TTS.GET /v2/voices
curl
curl -X GET "https://api.heygen.com/v1/audio/voices" \ -H "X-Api-Key: $HEYGEN_API_KEY"
TypeScript
interface TTSVoice { voice_id: string; language: string; gender: "female" | "male" | "unknown"; name: string; preview_audio_url: string | null; support_pause: boolean; support_locale: boolean; type: string; } interface TTSVoicesResponse { error: null | string; data: { voices: TTSVoice[]; }; } async function listTTSVoices(): Promise<TTSVoice[]> { const response = await fetch("https://api.heygen.com/v1/audio/voices", { headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! }, }); const json: TTSVoicesResponse = await response.json(); if (json.error) { throw new Error(json.error); } return json.data.voices; }
Python
import requests import os def list_tts_voices() -> list: response = requests.get( "https://api.heygen.com/v1/audio/voices", headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]} ) data = response.json() if data.get("error"): raise Exception(data["error"]) return data["data"]["voices"]
Response Format
{ "error": null, "data": { "voices": [ { "voice_id": "f38a635bee7a4d1f9b0a654a31d050d2", "name": "Chill Brian", "language": "English", "gender": "male", "preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3", "support_pause": true, "support_locale": false, "type": "public" } ] } }
Generate Speech Audio
Convert text to speech audio using a specified voice.
Endpoint
POST https://api.heygen.com/v1/audio/text_to_speech
Request Fields
| Field | Type | Req | Description |
|---|---|---|---|
| string | Y | Text content to convert to speech |
| string | Y | Voice ID from |
| number | Speech speed, 0.5-1.5 (default: 1) | |
| integer | Voice pitch, -50 to 50 (default: 0) | |
| string | Accent/locale for multilingual voices (e.g., , ) | |
| object | Advanced settings for ElevenLabs voices |
ElevenLabs Settings (optional)
| Field | Type | Description |
|---|---|---|
| string | Model selection (, , etc.) |
| number | Voice similarity, 0.0-1.0 |
| number | Output consistency, 0.0-1.0 |
| number | Style intensity, 0.0-1.0 |
curl
curl -X POST "https://api.heygen.com/v1/audio/text_to_speech" \ -H "X-Api-Key: $HEYGEN_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello! Welcome to our product demo.", "voice_id": "YOUR_VOICE_ID", "speed": 1.0 }'
TypeScript
interface TTSRequest { text: string; voice_id: string; speed?: number; pitch?: number; locale?: string; elevenlabs_settings?: { model?: string; similarity_boost?: number; stability?: number; style?: number; }; } interface WordTimestamp { word: string; start: number; end: number; } interface TTSResponse { error: null | string; data: { audio_url: string; duration: number; request_id: string; word_timestamps: WordTimestamp[]; }; } async function textToSpeech(request: TTSRequest): Promise<TTSResponse["data"]> { const response = await fetch( "https://api.heygen.com/v1/audio/text_to_speech", { method: "POST", headers: { "X-Api-Key": process.env.HEYGEN_API_KEY!, "Content-Type": "application/json", }, body: JSON.stringify(request), } ); const json: TTSResponse = await response.json(); if (json.error) { throw new Error(json.error); } return json.data; }
Python
import requests import os def text_to_speech( text: str, voice_id: str, speed: float = 1.0, pitch: int = 0, locale: str | None = None, ) -> dict: payload = { "text": text, "voice_id": voice_id, "speed": speed, "pitch": pitch, } if locale: payload["locale"] = locale response = requests.post( "https://api.heygen.com/v1/audio/text_to_speech", headers={ "X-Api-Key": os.environ["HEYGEN_API_KEY"], "Content-Type": "application/json", }, json=payload, ) data = response.json() if data.get("error"): raise Exception(data["error"]) return data["data"]
Response Format
{ "error": null, "data": { "audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav", "duration": 5.526, "request_id": "p38QJ52hfgNlsYKZZmd9", "word_timestamps": [ { "word": "<start>", "start": 0.0, "end": 0.0 }, { "word": "Hey", "start": 0.079, "end": 0.219 }, { "word": "there,", "start": 0.239, "end": 0.459 }, { "word": "<end>", "start": 5.526, "end": 5.526 } ] } }
Usage Examples
Basic TTS
const result = await textToSpeech({ text: "Welcome to our quarterly earnings call.", voice_id: "YOUR_VOICE_ID", }); console.log(`Audio URL: ${result.audio_url}`); console.log(`Duration: ${result.duration}s`);
With Speed Adjustment
const result = await textToSpeech({ text: "We're thrilled to announce our newest feature!", voice_id: "YOUR_VOICE_ID", speed: 1.1, });
With Locale for Multilingual Voices
const result = await textToSpeech({ text: "Bem-vindo ao nosso produto.", voice_id: "MULTILINGUAL_VOICE_ID", locale: "pt-BR", });
Find a Voice and Generate Audio
async function generateSpeech(text: string, language: string): Promise<string> { const voices = await listTTSVoices(); const voice = voices.find( (v) => v.language.toLowerCase().includes(language.toLowerCase()) ); if (!voice) { throw new Error(`No TTS voice found for language: ${language}`); } const result = await textToSpeech({ text, voice_id: voice.voice_id, }); return result.audio_url; } const audioUrl = await generateSpeech("Hello and welcome!", "english");
Pauses with Break Tags
Use SSML-style break tags in your text for pauses:
word <break time="1s"/> word
Rules:
- Use seconds with
suffix:s<break time="1.5s"/> - Must have spaces before and after the tag
- Self-closing tag format
Best Practices
- Use
to find compatible voices — not all voices fromGET /v1/audio/voices
support Starfish TTSGET /v2/voices - Check
before setting asupport_locale
— only multilingual voices support locale selectionlocale - Keep speed between 0.8-1.2 for natural-sounding output
- Preview voices using the
before generating (may be null for some voices)preview_audio_url - Use
in the response for caption syncing or timed text overlaysword_timestamps - Use SSML break tags in your text for pauses:
word <break time="1s"/> word