Claude-skill-registry gemini-audio
Guide for implementing Google Gemini API audio capabilities - analyze audio with transcription, summarization, and understanding (up to 9.5 hours), plus generate speech with controllable TTS. Use when processing audio files, creating transcripts, analyzing speech/music/sounds, or generating natural speech from text.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gemini-audio" ~/.claude/skills/majiayu000-claude-skill-registry-gemini-audio-1e59df && rm -rf "$T"
skills/data/gemini-audio/SKILL.mdGemini Audio API Skill
Process audio with transcription, analysis, and understanding, plus generate natural speech using Google's Gemini API. Supports up to 9.5 hours of audio per request with multiple formats.
When to Use This Skill
Use this skill when you need to:
- Transcribe audio files to text with timestamps
- Summarize audio content and extract key points
- Analyze speech, music, or environmental sounds
- Generate speech from text with controllable voice and style
- Process podcasts, interviews, meetings, or any audio content
- Understand non-speech audio (birdsong, sirens, music)
Prerequisites
API Key Setup
The skill automatically detects your
GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Skill directory:
.claude/skills/gemini-audio/.env - Project directory:
(project root)./.env
Get your API key: Visit Google AI Studio
Create
.env file with:
GEMINI_API_KEY=your_api_key_here
Python Setup
Install required package:
pip install google-genai
Quick Start
Audio Analysis (Transcription, Summarization)
from google import genai import os # API key auto-detected from environment client = genai.Client(api_key=os.getenv('GEMINI_API_KEY')) # Upload audio file myfile = client.files.upload(file='podcast.mp3') # Transcribe response = client.models.generate_content( model='gemini-2.5-flash', contents=['Generate a transcript of the speech.', myfile] ) print(response.text) # Summarize response = client.models.generate_content( model='gemini-2.5-flash', contents=['Summarize the key points in 5 bullets.', myfile] ) print(response.text)
Using Helper Scripts
# Transcribe audio python .claude/skills/gemini-audio/scripts/transcribe.py audio.mp3 # Summarize audio python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \ "Summarize key points" # Analyze specific segment (timestamps in MM:SS format) python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \ "What is discussed from 02:30 to 05:15?" # Generate speech python .claude/skills/gemini-audio/scripts/generate-speech.py \ "Welcome to our podcast" \ --output welcome.wav
Audio Understanding Capabilities
Supported Formats
| Format | MIME Type | Best Use |
|---|---|---|
| WAV | | Uncompressed, highest quality |
| MP3 | | Compressed, widely compatible |
| AAC | | Compressed, good quality |
| FLAC | | Lossless compression |
| OGG Vorbis | | Open format |
| AIFF | | Apple format |
Audio Specifications
- Maximum length: 9.5 hours per request
- Multiple files: Unlimited count, combined max 9.5 hours
- Token rate: 32 tokens/second (1 minute = 1,920 tokens)
- Processing: Auto-downsampled to 16 Kbps mono
- File size limits:
- Inline: 20 MB max total request
- File API: 2 GB per file, 20 GB project quota
- Retention: 48 hours auto-delete
Analysis Features
- Transcription: Full text with punctuation
- Timestamps: Reference segments (MM:SS format)
- Multi-speaker: Identify different speakers
- Non-speech: Analyze music, sounds, ambient audio
- Languages: Support for multiple languages
Speech Generation (TTS)
Available TTS Models
| Model | Quality | Speed | Cost/1M tokens |
|---|---|---|---|
| High | Fast | $10 |
TTS mode | Premium | Slower | $20 |
Controllable Voice Options
- Style: Professional, casual, narrative, conversational
- Pace: Slow, normal, fast
- Tone: Friendly, serious, enthusiastic
- Accent: Natural language control
TTS Example
response = client.models.generate_content( model='gemini-2.5-flash-native-audio-preview-09-2025', contents='Generate audio: Welcome to today\'s episode, in a warm, friendly tone.' ) # Save audio output with open('output.wav', 'wb') as f: f.write(response.audio_data)
Input Methods
Method 1: File Upload (Recommended for >20MB)
# Upload and reuse myfile = client.files.upload(file='large-audio.mp3') # Use file multiple times response1 = client.models.generate_content( model='gemini-2.5-flash', contents=['Transcribe this', myfile] ) response2 = client.models.generate_content( model='gemini-2.5-flash', contents=['Summarize this', myfile] )
Method 2: Inline Data (<20MB)
from google.genai import types with open('small-audio.mp3', 'rb') as f: audio_bytes = f.read() response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Describe this audio', types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3') ] )
Common Use Cases
Transcription
python scripts/transcribe.py meeting.mp3 --include-timestamps
Summary with Key Points
python scripts/analyze.py interview.wav "Extract main topics and key quotes"
Speaker Identification
python scripts/analyze.py discussion.mp3 "Identify speakers and extract dialogue"
Segment Analysis
python scripts/analyze.py podcast.mp3 "Summarize content from 10:30 to 15:45"
Non-Speech Analysis
python scripts/analyze.py ambient.wav "Identify all sounds: voices, music, ambient"
Best Practices
File Management
- Use File API for files >20MB or repeated usage
- Files auto-delete after 48 hours
- Manage quota (20 GB project limit)
Prompt Engineering
- Be specific: "Transcribe from 02:30 to 03:29"
- Use timestamps for segment analysis (MM:SS format)
- Combine tasks: "Transcribe and summarize"
- Provide context: "This is a medical interview"
Cost Optimization
- Use
($1/1M tokens) for most tasksgemini-2.5-flash - Upgrade to
($3/1M tokens) for complex analysisgemini-2.5-pro - Check token count: 1 min audio = 1,920 tokens
Error Handling
- Validate file format and size before upload
- Implement exponential backoff for rate limits
- Handle 48-hour file expiration
Token Costs & Pricing
Audio Input (32 tokens/second):
- 1 minute = 1,920 tokens
- 1 hour = 115,200 tokens
- 9.5 hours = 1,094,400 tokens
Model Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
Reference Documentation
For detailed information, see:
- Complete API specificationsreferences/api-reference.md
- Comprehensive code examplesreferences/code-examples.md
- Text-to-speech implementation guidereferences/tts-guide.md
- Advanced optimization strategiesreferences/best-practices.md
Scripts Overview
All scripts support 3-step API key detection:
- transcribe.py: Generate transcripts with optional timestamps
- analyze.py: General audio analysis with custom prompts
- generate-speech.py: Text-to-speech generation
- manage-files.py: Upload, list, and delete audio files
Run any script with
--help for detailed usage.