Skills ai-provider-elevenlabs
ElevenLabs voice AI SDK patterns for TypeScript/Node.js -- text-to-speech, streaming, voice cloning, speech-to-speech, pronunciation control, and conversational AI
git clone https://github.com/agents-inc/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/agents-inc/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/src/skills/ai-provider-elevenlabs" ~/.claude/skills/agents-inc-skills-ai-provider-elevenlabs-90058d && rm -rf "$T"
src/skills/ai-provider-elevenlabs/SKILL.mdElevenLabs Patterns
Quick Guide: Use the official
package to interact with the ElevenLabs API. Use@elevenlabs/elevenlabs-jsfor full audio generation orclient.textToSpeech.convert()for low-latency streaming. Voice settings (client.textToSpeech.stream(),stability,similarityBoost) control output character. Usestylefor best quality,eleven_v3for lowest latency, oreleven_flash_v2_5for stable long-form content. The SDK returnseleven_multilingual_v2-- pipe to files or HTTP responses. UseReadableStream<Uint8Array>for real-time conversational AI agents.@elevenlabs/client
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for server-side TTS, voice management, and speech-to-speech -- use @elevenlabs/elevenlabs-js
only for conversational AI agents)@elevenlabs/client
(You MUST never hardcode API keys -- always use environment variables via
which the SDK reads automatically)process.env.ELEVENLABS_API_KEY
(You MUST consume the
returned by ReadableStream<Uint8Array>
and convert()
-- unconsumed streams leak resources)stream()
(You MUST choose the correct model for your use case --
for quality, eleven_v3
for speed, eleven_flash_v2_5
for long-form stability)eleven_multilingual_v2
(You MUST pass
as the first positional argument to all voiceId
methods -- it is NOT inside the options object)textToSpeech
</critical_requirements>
Auto-detection: ElevenLabs, elevenlabs, ElevenLabsClient, textToSpeech.convert, textToSpeech.stream, eleven_multilingual_v2, eleven_flash_v2_5, eleven_v3, speechToSpeech, voices.search, voice cloning, ELEVENLABS_API_KEY, @elevenlabs/elevenlabs-js, @elevenlabs/client, text-to-speech, TTS, voice synthesis
When to use:
- Generating speech audio from text (narration, audiobooks, announcements)
- Streaming audio in real-time for low-latency playback
- Cloning voices from audio samples (instant or professional voice cloning)
- Converting speech from one voice to another (speech-to-speech)
- Building real-time conversational AI agents with voice interaction
- Controlling pronunciation with SSML or pronunciation dictionaries
- Generating audio with character-level timestamp alignment
Key patterns covered:
- Client initialization and configuration (retries, timeouts, API key)
- Text-to-speech conversion and streaming (
,convert
, timestamps)stream - Voice settings (
,stability
,similarityBoost
,style
)speed - Voice selection and management (
,voices.search
)voices.get - Voice cloning (instant via
)voices.ivc.create - Speech-to-speech voice conversion
- WebSocket input streaming for real-time text-to-speech
- Pronunciation dictionaries and SSML
- Conversational AI agents (
)@elevenlabs/client - Model selection, output formats, error handling
When NOT to use:
- You need multi-provider voice AI (multiple TTS vendors) -- use a unified abstraction
- You only need browser-side audio playback without generation -- use the Web Audio API
- You need speech-to-text transcription only -- ElevenLabs has this, but it is a separate concern
Examples Index
- Core: Setup, TTS, Streaming & Voice Settings -- Client init, convert, stream, timestamps, voice settings, output formats
- Voices & Cloning -- Voice search, selection, instant voice cloning, speech-to-speech
- WebSocket & Conversational AI -- WebSocket input streaming, conversational AI agents, real-time patterns
- Quick API Reference -- Model IDs, method signatures, output formats, error types, voice settings
<philosophy>
Philosophy
The ElevenLabs SDK provides direct access to the most advanced voice AI API available. It wraps the ElevenLabs REST API with full TypeScript types, streaming support, and automatic retries.
Core principles:
- Streams everywhere -- All audio methods return
. You pipe them to files, HTTP responses, or audio players. The SDK never buffers entire audio files in memory.ReadableStream<Uint8Array> - Voice settings are the primary control surface --
,stability
,similarityBoost
, andstyle
shape every generation. Learn these four knobs well.speed - Model selection drives the quality/latency tradeoff --
for best quality,eleven_v3
for sub-75ms latency,eleven_flash_v2_5
for stable long-form.eleven_multilingual_v2 - Two packages for two use cases --
for server-side TTS/voice management,@elevenlabs/elevenlabs-js
for browser-side conversational AI agents.@elevenlabs/client - Built-in resilience -- The SDK retries on 408, 409, 429, and 5xx errors (2 retries by default) with configurable timeouts.
<patterns>
Core Patterns
Pattern 1: Client Setup
Initialize the ElevenLabs client. It auto-reads
ELEVENLABS_API_KEY from the environment.
// lib/elevenlabs.ts -- basic setup import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js"; const client = new ElevenLabsClient(); export { client };
// lib/elevenlabs.ts -- production configuration import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js"; const TIMEOUT_SECONDS = 60; const MAX_RETRIES = 3; const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY, timeoutInSeconds: TIMEOUT_SECONDS, maxRetries: MAX_RETRIES, }); export { client };
Why good: Minimal setup, env var auto-detected, named constants for production settings
// BAD: Hardcoded API key const client = new ElevenLabsClient({ apiKey: "sk-1234567890abcdef", });
Why bad: Hardcoded API key is a security breach risk, will leak in version control
See: examples/core.md for per-request overrides, error handling
Pattern 2: Text-to-Speech (Convert)
Generate complete audio from text. Returns
ReadableStream<Uint8Array>.
import { createWriteStream } from "node:fs"; import { Readable } from "node:stream"; const VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; // George const audio = await client.textToSpeech.convert(VOICE_ID, { text: "Welcome to the application.", modelId: "eleven_multilingual_v2", outputFormat: "mp3_44100_128", }); // Pipe to file const readable = Readable.fromWeb(audio); const fileStream = createWriteStream("output.mp3"); readable.pipe(fileStream);
Why good:
voiceId as first arg (required), model and format explicit, stream piped to file without buffering
// BAD: voiceId inside options object const audio = await client.textToSpeech.convert({ voiceId: VOICE_ID, // WRONG: voiceId is a positional argument text: "Hello", });
Why bad:
voiceId is the first positional argument, not an options field -- this will throw a type error
See: examples/core.md for timestamps, HTTP response piping
Pattern 3: Text-to-Speech (Stream)
Stream audio for real-time playback with lower latency than
convert().
const VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; const LATENCY_OPTIMIZATION = 2; const audioStream = await client.textToSpeech.stream(VOICE_ID, { text: "This streams with lower latency for real-time playback.", modelId: "eleven_flash_v2_5", optimizeStreamingLatency: LATENCY_OPTIMIZATION, outputFormat: "mp3_44100_128", }); // Consume the stream for await (const chunk of audioStream) { process.stdout.write(chunk); // Or pipe to audio player / HTTP response }
Why good: Uses
stream() for lower latency, eleven_flash_v2_5 for speed, optimizeStreamingLatency reduces first-byte time
// BAD: Stream created but never consumed const audioStream = await client.textToSpeech.stream(VOICE_ID, { text: "This audio is lost", modelId: "eleven_flash_v2_5", }); // Stream never consumed -- resources leaked
Why bad: Unconsumed streams leak resources and the audio data is silently lost
See: examples/core.md for streaming to HTTP responses
Pattern 4: Voice Settings
Control voice characteristics with
voiceSettings.
const VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; const audio = await client.textToSpeech.convert(VOICE_ID, { text: "Emotional and expressive delivery.", modelId: "eleven_v3", voiceSettings: { stability: 0.3, // Lower = more expressive/variable similarityBoost: 0.8, // Higher = closer to original voice style: 0.5, // Higher = more style exaggeration useSpeakerBoost: true, // Enhanced speaker similarity (adds latency) speed: 1.0, // 0.7-1.3 range typical }, });
Why good: All settings explicit with clear purpose, stability lowered for expressive content
// BAD: Using extreme values without understanding const audio = await client.textToSpeech.convert(VOICE_ID, { text: "Extreme settings cause artifacts.", modelId: "eleven_v3", voiceSettings: { stability: 0.0, // Too unstable -- garbled output similarityBoost: 1.0, // Combined with low stability = artifacts style: 1.0, // Maximum exaggeration -- unnatural }, });
Why bad: Extreme values produce artifacts;
stability: 0.0 with high similarityBoost is unstable. Start with defaults and adjust incrementally.
See: reference.md for voice settings ranges and recommended starting values
Pattern 5: Voice Selection and Management
Find and select voices from the ElevenLabs voice library.
// Search all available voices const { voices } = await client.voices.search(); for (const voice of voices) { console.log(`${voice.name} (${voice.voiceId}) - ${voice.category}`); } // Get a specific voice by ID const VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; const voice = await client.voices.get(VOICE_ID); console.log(voice.name, voice.settings);
Why good: Uses
voices.search() to discover available voices, voices.get() for details
See: examples/voices.md for filtering, voice cloning, speech-to-speech
Pattern 6: Voice Cloning (Instant)
Create an instant voice clone from audio samples.
import { createReadStream } from "node:fs"; const voice = await client.voices.ivc.create({ name: "My Custom Voice", files: [createReadStream("sample1.mp3"), createReadStream("sample2.mp3")], removeBackgroundNoise: true, }); console.log(`Created voice: ${voice.voiceId}`); // Use the cloned voice for TTS const audio = await client.textToSpeech.convert(voice.voiceId, { text: "Speaking in the cloned voice.", modelId: "eleven_multilingual_v2", });
Why good:
removeBackgroundNoise improves quality, multiple samples improve accuracy, immediately usable
See: examples/voices.md for professional voice cloning, sample validation
Pattern 7: Speech-to-Speech
Convert speech from one voice to another while preserving emotion and cadence.
import { createReadStream } from "node:fs"; const TARGET_VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; const convertedAudio = await client.speechToSpeech.convert(TARGET_VOICE_ID, { audio: createReadStream("input-speech.mp3"), modelId: "eleven_multilingual_sts_v2", voiceSettings: { stability: 0.5, similarityBoost: 0.75, }, });
Why good: Uses STS-specific model, preserves source emotion, voice settings control output fidelity
See: examples/voices.md for streaming STS, English-only model
Pattern 8: Error Handling
Catch SDK errors and handle specific failure modes.
import { ElevenLabsError, ElevenLabsTimeoutError, } from "@elevenlabs/elevenlabs-js"; const VOICE_ID = "JBFqnCBsd6RMkjVDRZzb"; try { const audio = await client.textToSpeech.convert(VOICE_ID, { text: "Hello, world.", modelId: "eleven_multilingual_v2", }); } catch (error) { if (error instanceof ElevenLabsTimeoutError) { console.error("Request timed out -- increase timeoutInSeconds or retry"); } else if (error instanceof ElevenLabsError) { console.error(`ElevenLabs API error: ${error.message}`); console.error(`Status: ${error.statusCode}`); console.error(`Body: ${JSON.stringify(error.body)}`); } else { throw error; // Re-throw non-ElevenLabs errors } }
Why good: Catches specific error types, logs status code and body for debugging, re-throws unknown errors
See: examples/core.md for stream error handling, retry patterns
</patterns><performance>
Performance Optimization
Model Selection for Latency/Quality
Best quality + expressiveness -> eleven_v3 (70+ languages) Long-form stability -> eleven_multilingual_v2 (29 languages, 10K char limit) Lowest latency (<75ms) -> eleven_flash_v2_5 (32 languages, 40K char limit) English-only low latency -> eleven_flash_v2 (English only, 30K char limit) Voice design from text prompt -> eleven_ttv_v3 (70+ languages)
Key Optimization Patterns
- Use
instead ofstream()
for user-facing audio -- playback starts before generation completesconvert() - Set
(0-4) onoptimizeStreamingLatency
calls -- higher values reduce latency but may affect text normalizationstream() - Use
for real-time applications -- sub-75ms latency at 50% lower costeleven_flash_v2_5 - Use
for multi-part generation -- maintains voice consistency across segmentsprevious_request_ids - Batch multiple short texts into single requests when possible -- reduces API call overhead
- Cache generated audio for static content -- avoid re-generating identical text
- Use
for server-side processing pipelines -- lower bandwidth than MP3outputFormat: "pcm_16000"
<decision_framework>
Decision Framework
Which Model to Choose
What is your priority? +-- Best quality / expressiveness -> eleven_v3 +-- Lowest latency (<75ms) -> eleven_flash_v2_5 +-- Long-form stability (audiobooks) -> eleven_multilingual_v2 +-- English-only speed -> eleven_flash_v2 +-- Voice design from text description -> eleven_ttv_v3 +-- Speech-to-speech conversion -> eleven_multilingual_sts_v2 (or eleven_english_sts_v2)
convert() vs stream()
Is the audio user-facing with real-time playback? +-- YES -> Use stream() for progressive playback | +-- Need timestamps? -> streamWithTimestamps() +-- NO -> Use convert() for complete audio +-- Need timestamps? -> convertWithTimestamps() +-- Saving to file? -> convert() and pipe to WriteStream
Which Package to Use
What are you building? +-- Server-side TTS, voice management, STS -> @elevenlabs/elevenlabs-js +-- Browser conversational AI agent -> @elevenlabs/client +-- React conversational AI agent -> @elevenlabs/react +-- WebSocket text input streaming -> @elevenlabs/elevenlabs-js (or raw WebSocket)
Output Format Selection
What is the audio destination? +-- Web browser playback -> mp3_44100_128 (universal compatibility) +-- Low-bandwidth streaming -> opus_48000_64 (smaller files) +-- Audio processing pipeline -> pcm_16000 or pcm_44100 (raw audio) +-- Telephony / IVR -> ulaw_8000 or alaw_8000 (legacy codecs) +-- High-quality archival -> wav_44100 or mp3_44100_192
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Hardcoding API keys instead of using
(security breach risk)process.env.ELEVENLABS_API_KEY - Not consuming streams returned by
orconvert()
(resources leaked, audio lost)stream() - Passing
inside the options object instead of as the first positional argument (type error)voiceId - Using deprecated
instead ofeleven_turbo_v2_5
(migrate to Flash models)eleven_flash_v2_5 - Using deprecated
oreleven_monolingual_v1
(use v2+ models)eleven_multilingual_v1
Medium Priority Issues:
- Not setting
for production (default is 240 seconds -- may be too long or too short)timeoutInSeconds - Using
or extreme voice settings without testing (produces artifacts)stability: 0.0 - Not using
when streaming to users (adds unnecessary latency)optimizeStreamingLatency - Ignoring
and relying on the default when a specific format is neededoutputFormat - Creating voice clones with a single short sample (multiple 30s+ samples improve quality)
Common Mistakes:
- Confusing
(server-side TTS SDK) with@elevenlabs/elevenlabs-js
(conversational AI agents SDK) -- they serve different purposes@elevenlabs/client - Using
for real-time playback instead oftextToSpeech.convert()
-- convert waits for full generationtextToSpeech.stream() - Sending text longer than the model's character limit (10K for multilingual_v2, 40K for flash_v2_5, 5K for v3) -- request will fail
- Not using
for multi-part audio -- causes voice inconsistency between segmentsprevious_request_ids - Using
when latency matters -- it has higher latency than Flash modelseleven_v3
Gotchas & Edge Cases:
- The SDK auto-retries on 408, 409, 429, and 5xx errors -- 2 retries by default. Set
if you handle retries yourself.maxRetries: 0
andconvert()
both returnstream()
butReadableStream<Uint8Array>
starts sending data before generation completes (lower time-to-first-byte).stream()
returnsconvertWithTimestamps()
NOT a stream -- the entire audio is base64-encoded.{ audioBase64, alignment }
returns an SSEstreamWithTimestamps()
-- each chunk has audio data AND character timing.Stream<ChunkWithTimestamps>- Voice settings are optional -- if omitted, the voice's default settings are used. Override per-request for fine-tuning.
- The
helper function from the SDK requires MPV and FFmpeg installed locally -- not suitable for production servers.play() - WebSocket input streaming text must end with a space character for proper buffering.
- WebSocket
defaults tochunk_length_schedule
characters -- audio generation starts after the first threshold.[120, 160, 250, 290] - Pronunciation dictionaries are limited to 3 per request and must be provided in the first WebSocket message.
must be set as a query parameter on the WebSocket connection, not in the text message.enable_ssml_parsing- The
voice setting accepts values roughly in the 0.7-1.3 range for natural-sounding output.speed - Free tier has 2-4 concurrent request limits -- higher tiers get elevated concurrency.
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST use
for server-side TTS, voice management, and speech-to-speech -- use @elevenlabs/elevenlabs-js
only for conversational AI agents)@elevenlabs/client
(You MUST never hardcode API keys -- always use environment variables via
which the SDK reads automatically)process.env.ELEVENLABS_API_KEY
(You MUST consume the
returned by ReadableStream<Uint8Array>
and convert()
-- unconsumed streams leak resources)stream()
(You MUST choose the correct model for your use case --
for quality, eleven_v3
for speed, eleven_flash_v2_5
for long-form stability)eleven_multilingual_v2
(You MUST pass
as the first positional argument to all voiceId
methods -- it is NOT inside the options object)textToSpeech
Failure to follow these rules will produce broken, insecure, or degraded voice AI integrations.
</critical_reminders>