Awesome-openclaw-skills audio-gen
Generate audiobooks, podcasts, or educational audio content on demand. User provides an idea or topic, Claude AI writes a script, and ElevenLabs converts it to high-quality audio. Supports multiple formats (audiobook, podcast, educational), custom lengths, and voice effects. Use when asked to create audio content, make a podcast, generate an audiobook, or produce educational audio. Returns MP3 audio file via MEDIA token.
git clone https://github.com/sundial-org/awesome-openclaw-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/sundial-org/awesome-openclaw-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/audio-gen" ~/.claude/skills/sundial-org-awesome-openclaw-skills-audio-gen && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/sundial-org/awesome-openclaw-skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/audio-gen" ~/.openclaw/skills/sundial-org-awesome-openclaw-skills-audio-gen && rm -rf "$T"
skills/audio-gen/SKILL.md🎙️ Audio Content Generator
Generate high-quality audiobooks, podcasts, or educational audio content on demand using AI-written scripts and ElevenLabs text-to-speech.
Quick Start
Create an audiobook chapter:
User: "Create a 5-minute audiobook chapter about a dragon discovering friendship"
Generate a podcast:
User: "Make a 10-minute podcast about the history of coffee"
Produce educational content:
User: "Generate a 15-minute educational audio explaining how neural networks work"
Content Formats
Audiobook
Style: Narrative storytelling with emotional depth
- Clear beginning, middle, and end
- Descriptive language and vivid imagery
- Dramatic pacing with thoughtful pauses
- Emotional tone that matches the story
- Use voice effects like
,[whispers]
,[excited]
for impact[serious]
Example Structure:
[Opening hook - set the scene] [long pause] [Story development with character emotions] [short pause] between sentences [long pause] between paragraphs [Climax with dramatic tension] [long pause] [Resolution and emotional closure]
Podcast
Style: Conversational and engaging
- Warm, welcoming intro (15-30 seconds)
- Main content with natural flow
- Transitions between topics
- Memorable outro with key takeaways
- Conversational tone throughout
Example Structure:
**Intro:** "Welcome to [topic]. I'm excited to share..." [short pause] **Main Content:** "Let's start with... [topic 1]" [long pause] between segments **Outro:** "Thanks for listening! Remember..."
Educational Content
Style: Clear explanations for learning
- Simple introductions to complex topics
- Step-by-step breakdowns
- Real-world examples and analogies
- Recap of key concepts at the end
- Enthusiastic delivery with
for important points[excited]
Example Structure:
**Introduction:** What is [topic] and why it matters? **Main Content:** - Concept 1: Explanation + Example - Concept 2: Explanation + Example - Concept 3: Explanation + Example **Summary:** Key takeaways and next steps
Length Guidelines
Word Count to Duration Conversion:
- 5 minutes = ~375 words
- 10 minutes = ~750 words
- 15 minutes = ~1,125 words
- 20 minutes = ~1,500 words
- 30 minutes = ~2,250 words
Pacing: Average conversational speed is ~75 words per minute
Practical Limits:
- Minimum: 2 minutes (~150 words)
- Maximum: 30 minutes (~2,250 words)
- Sweet spot: 5-15 minutes for best engagement
Workflow Instructions
Step 1: Understand the Request
Parse the user's request for:
- Content type (audiobook, podcast, educational, or inferred from topic)
- Topic/theme (what should the content be about)
- Target length (how many minutes)
- Tone/style (dramatic, casual, educational, etc.)
- Special requests (specific voice, emphasis on certain points)
Step 2: Calculate Word Count
target_words = target_minutes × 75
Example: 10 minutes = 10 × 75 = 750 words
Step 3: Generate the Script
Write the complete script following these rules:
Content Guidelines:
- Start strong with an engaging hook
- Maintain natural, conversational flow
- Use active voice and simple sentence structure
- Include relevant examples and stories
- End with a satisfying conclusion
Formatting Rules:
- Add
after sentences (use sparingly, not every sentence)[short pause] - Add
between paragraphs or major sections[long pause] - Use voice effects strategically:
,[whispers]
,[shouts]
,[excited]
,[serious]
,[sarcastic]
,[sings][laughs] - Write numbers as words: "twenty-three" not "23"
- Spell out acronyms first time: "AI, or artificial intelligence"
- Avoid complex punctuation (em-dashes work, but semicolons don't read well)
- Remove markdown formatting before TTS conversion
Step 4: Present the Script
Show the script to the user and ask:
Here's the [format] script I've created (approximately [length] minutes): [Display the script] Would you like me to: 1. Generate the audio now 2. Make changes to the script 3. Adjust the length or tone
Step 5: Handle User Feedback
If user requests changes:
- Regenerate the script with adjustments
- Maintain the target word count
- Present the revised version
If user approves:
- Proceed to audio generation
Step 6: Generate Audio
Format the script for TTS:
- Remove any remaining markdown (headers, bold, italics)
- Ensure voice effects are in proper
format[effect] - Check that pauses are appropriately placed
- Verify numbers and acronyms are spelled out
Invoke the TTS script:
IMPORTANT: The
ELEVENLABS_API_KEY environment variable is already configured in the system. Simply invoke the TTS script directly.
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \ -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \ -m eleven_multilingual_v2 \ "[formatted_script]"
For long scripts, use heredoc:
uv run /home/clawdbot/clawdbot/skills/sag/scripts/tts.py \ -o /tmp/audio-gen-[timestamp]-[topic-slug].mp3 \ -m eleven_multilingual_v2 \ "$(cat <<'EOF' [formatted_script] EOF )"
Return the result:
MEDIA:/tmp/audio-gen-[timestamp]-[topic-slug].mp3 Your [format] is ready! [Brief description of content]. Duration: approximately [X] minutes.
Voice Effects (SSML Tags)
Available voice modulation effects (use sparingly for impact):
- Soft, intimate delivery[whispers]
- Loud, emphatic delivery[shouts]
- Enthusiastic, energetic tone[excited]
- Grave, solemn tone[serious]
- Ironic, mocking tone[sarcastic]
- Musical, melodic delivery[sings]
- Amused, jovial tone[laughs]
- Brief silence (~0.5s)[short pause]
- Extended silence (~1-2s)[long pause]
Best Practices:
- Use effects for emotional moments, not every sentence
- Pauses are your most powerful tool for pacing
- Voice effects work best in audiobooks and dramatic content
- Keep podcasts and educational content mostly natural
Error Handling
Script Too Long
If the generated script exceeds target by >20%:
The script I generated is [X] words ([Y] minutes), which is longer than your target of [Z] minutes. Would you like me to: 1. Condense it to fit the target length 2. Split it into multiple parts 3. Keep it as is
Script Too Short
If the generated script is under target by >20%:
The script is [X] words ([Y] minutes), shorter than your target. Would you like me to: 1. Expand it with more detail 2. Add additional examples or stories 3. Generate as is
TTS Generation Fails
If the TTS script fails:
I've created the script, but I'm unable to generate the audio right now. Here's your script: [Display script] Error: [specific error message] You can: 1. Check that ELEVENLABS_API_KEY is configured 2. Use the script with your own text-to-speech tool 3. Try again in a moment 4. Ask me to troubleshoot the audio generation
Common TTS Issues:
- API key not set: Verify ELEVENLABS_API_KEY in config
- Rate limit: Wait a moment and try again
- Text too long: Break into smaller chunks (max ~5000 characters)
Invalid Request
For unrealistic requests (e.g., "100-hour audiobook"):
That length would require [X] words and take significant time to generate. I recommend: - Breaking it into multiple episodes/chapters - Targeting 5-30 minutes per audio file - Creating a series instead of one long file
Tips for Best Results
For Engaging Audiobooks
- Focus on character emotions and sensory details
- Use pauses to build dramatic tension
- Vary sentence length for rhythm
- Include internal monologue and reflection
For Compelling Podcasts
- Start with a question or surprising fact
- Use conversational phrases: "You know what's interesting..."
- Include relatable examples from everyday life
- End with actionable takeaways
For Effective Educational Content
- Use the "explain like I'm five" approach
- Build from simple to complex concepts
- Repeat key terms and definitions
- Provide multiple examples for clarity
Technical Notes
TTS Implementation:
- Uses Python script:
~/.clawdbot/clawdbot/skills/sag/scripts/tts.py - No binary installation required (pure Python + requests)
- Directly calls ElevenLabs API
- Compatible with Linux and macOS
File Storage:
- Audio files are saved to
/tmp/audio-gen/ - Filename format:
audio-gen-[timestamp]-[topic-slug].mp3 - Files are automatically cleaned up after 24 hours
API Requirements:
- Anthropic API for script generation (already configured)
- ElevenLabs API for text-to-speech (configured via ELEVENLABS_API_KEY)
- Both services must be configured and have available credits
Supported Models:
- Best quality (default)eleven_multilingual_v2
- Faster generationeleven_turbo_v2
- Fastest generationeleven_turbo_v2_5
- Legacy modeleleven_multilingual_v1
Cost Estimate:
- 10-minute audio (~750 words): approximately $1.43
- Claude API: ~$0.075
- ElevenLabs: ~$1.35
- Longer content scales proportionally
Generation Time:
- Script generation: 5-30 seconds (depending on length)
- Audio generation: 5-15 seconds (ElevenLabs processing)
- Total: Usually under 1 minute for 10-minute audio
Limitations
-
Maximum Length: 30 minutes (~2,250 words) per audio file
- For longer content, create multiple parts/episodes
-
Single Voice: Currently supports one narrator voice
- Cannot do multi-voice dialogue or character voices
-
No Background Music: Pure voice narration only
- No background music, sound effects, or audio mixing
-
Real-time Generation: Each request generates fresh content
- No pre-made templates or cached audio
-
Language: Primarily English
- ElevenLabs supports other languages, but content generation optimized for English
Example Conversations
Example 1: Quick Audiobook
User: Create a 5-minute audiobook chapter about a lonely robot learning to feel emotions Agent: I'll create a 5-minute sci-fi audiobook chapter. Let me write the story... [Generates ~400-word narrative with emotional arc] Here's the story I've created about Unit-7, a maintenance robot who discovers the meaning of friendship through an unexpected encounter. Would you like me to generate the audio or make changes first? User: Generate the audio Agent: [Invokes SAG with formatted script] MEDIA:/tmp/audio-gen-1234567890-lonely-robot.mp3 Your audiobook chapter is ready! The story follows Unit-7's emotional awakening through a touching encounter with a young engineer. Duration: approximately 5 minutes.
Example 2: Podcast with Revisions
User: Make a 10-minute podcast about sustainable living tips Agent: [Generates podcast script with intro, 5 practical tips, outro] Here's a 10-minute podcast with 5 actionable sustainable living tips. The tone is encouraging and practical. Would you like me to: 1. Generate audio now 2. Make it more urgent/climate-focused 3. Add more specific examples 4. Adjust the length User: Add more specific examples for each tip Agent: [Regenerates with detailed examples] Better? I've added specific examples like calculating your carbon footprint, composting at home, and choosing sustainable brands. Ready for audio? User: Perfect! Agent: MEDIA:/tmp/audio-gen-1234567890-sustainable-living.mp3 Your sustainable living podcast is ready with 5 practical tips, each with concrete examples you can start using today! Duration: 10 minutes.
Example 3: Educational Content
User: Create a 15-minute educational audio explaining quantum computing for beginners Agent: I'll create a beginner-friendly 15-minute explanation of quantum computing... [Generates ~1,100 words with analogies, examples, and clear structure] Script preview: - Introduction: What is quantum computing? (2 min) - Classical vs Quantum: The key differences (4 min) - Qubits and Superposition: How it works (5 min) - Real Applications: What it means for you (3 min) - Conclusion: The future of computing (1 min) Should I generate the audio? User: Yes please Agent: MEDIA:/tmp/audio-gen-1234567890-quantum-computing.mp3 Your educational podcast on quantum computing is ready! It uses everyday analogies to explain qubits, superposition, and real-world applications. Duration: approximately 15 minutes.
When to Use This Skill
Use the audio-gen skill when users request:
- "Create an audiobook about..."
- "Make a podcast episode about..."
- "Generate audio content explaining..."
- "Record a story about..."
- "Produce an audio guide for..."
- "Create a narrated version of..."
- "Make an audio lesson about..."
Do NOT use for:
- Transcribing existing audio (use whisper skills instead)
- Converting user-written text to audio (use SAG skill directly)
- Real-time voice interaction (use voice-call plugin)
- Multi-speaker dialogues or interviews (current limitation)
Advanced Usage
Voice Selection
Users can request specific voices:
User: Create a podcast with the Rachel voice Agent: [Checks voice configuration for "Rachel" voice ID and uses it in SAG command]
Series/Episodes
For multi-part content:
User: Create a 3-episode series about space exploration Agent: I'll create Episode 1 first. Each episode will be 10 minutes. Let's start with "The Dawn of the Space Age"... [After completion] Episode 1 is ready! Would you like me to continue with Episode 2?
Format Blending
Mix formats for unique styles:
User: Create an educational podcast that tells a story Agent: [Generates content that combines storytelling narrative with educational explanations]
Troubleshooting
Issue: Audio sounds robotic or unnatural Solution: Add more pauses and voice effects. Use contractions and conversational language.
Issue: Script doesn't match requested length Solution: Regenerate with explicit word count target. Check calculations (75 words/min).
Issue: Content is too technical or too simple Solution: Ask user for target audience. Adjust complexity accordingly.
Issue: SAG command fails Solution: Check ELEVENLABS_API_KEY is set. Verify SAG skill is installed and working.
Issue: User wants to edit the script manually Solution: Provide the plain text script. User can modify it and paste back for audio generation.
💡 Pro Tip: Always generate the script first and get user approval before creating audio. This saves time and API costs, and ensures the user gets exactly what they want.