Learn-skills.dev ai-multimodal
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/aia-11-hn-mib/mib-mockinterviewaibot/ai-multimodal" ~/.claude/skills/neversight-learn-skills-dev-ai-multimodal && rm -rf "$T"
data/skills-md/aia-11-hn-mib/mib-mockinterviewaibot/ai-multimodal/SKILL.mdAI Multimodal Processing Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
Core Capabilities
Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
Capability Matrix
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |
Model Selection Guide
Gemini 2.5 Series (Recommended)
- gemini-2.5-pro: Highest quality, all features, 1M-2M context
- gemini-2.5-flash: Best balance, all features, 1M-2M context
- gemini-2.5-flash-lite: Lightweight, segmentation support
- gemini-2.5-flash-image: Image generation only
Feature Requirements
- Segmentation: Requires 2.5+ models
- Object Detection: Requires 2.0+ models
- Multi-video: Requires 2.5+ models
- Image Generation: Requires flash-image model
Context Windows
- 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
- 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
Quick Start
Prerequisites
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for
GEMINI_API_KEY in this order:
- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env .claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env
Get API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Common Patterns
Transcribe Audio:
python scripts/gemini_batch_process.py \ --files audio.mp3 \ --task transcribe \ --model gemini-2.5-flash
Analyze Image:
python scripts/gemini_batch_process.py \ --files image.jpg \ --task analyze \ --prompt "Describe this image" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash
Process Video:
python scripts/gemini_batch_process.py \ --files video.mp4 \ --task analyze \ --prompt "Summarize key points with timestamps" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash
Extract from PDF:
python scripts/gemini_batch_process.py \ --files document.pdf \ --task extract \ --prompt "Extract table data as JSON" \ --output docs/assets/<output-name>.md \ --format json
Generate Image:
python scripts/gemini_batch_process.py \ --task generate \ --prompt "A futuristic city at sunset" \ --output docs/assets/<output-file-name> \ --model gemini-2.5-flash-image \ --aspect-ratio 16:9
Optimize Media:
# Prepare large video for processing python scripts/media_optimizer.py \ --input large-video.mp4 \ --output docs/assets/<output-file-name> \ --target-size 100MB # Batch optimize multiple files python scripts/media_optimizer.py \ --input-dir ./videos \ --output-dir docs/assets/optimized \ --quality 85
Convert Documents to Markdown:
# Convert to PDF python scripts/document_converter.py \ --input document.docx \ --output docs/assets/document.md # Extract pages python scripts/document_converter.py \ --input large.pdf \ --output docs/assets/chapter1.md \ --pages 1-20
Supported Formats
Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono
Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: ≤384px = 258 tokens, larger = tiled
Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)
Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)
Size Limits
- Inline: <20MB total request
- File API: 2GB per file, 20GB project quota
- Retention: 48 hours auto-delete
Reference Navigation
For detailed implementation guidance, see:
Audio Processing
- Transcription, analysis, TTSreferences/audio-processing.md- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation
Image Understanding
- Captioning, detection, OCRreferences/vision-understanding.md- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison
Video Analysis
- Scene detection, temporal understandingreferences/video-analysis.md- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization
Document Extraction
- PDF processing, structured outputreferences/document-extraction.md- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling
Image Generation
- Text-to-image, editingreferences/image-generation.md- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings
Cost Optimization
Token Costs
Input Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
Token Rates:
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- Video: ~300 tokens/second (default) or ~100 (low-res)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
Best Practices
- Use
for most tasks (best price/performance)gemini-2.5-flash - Use File API for files >20MB or repeated queries
- Optimize media before upload (see
)media_optimizer.py - Process specific segments instead of full videos
- Use lower FPS for static content
- Implement context caching for repeated queries
- Batch process multiple files in parallel
Rate Limits
Free Tier:
- 10-15 RPM (requests per minute)
- 1M-4M TPM (tokens per minute)
- 1,500 RPD (requests per day)
YouTube Limits:
- Free tier: 8 hours/day
- Paid tier: No length limits
- Public videos only
Storage Limits:
- 20GB per project
- 2GB per file
- 48-hour retention
Error Handling
Common errors and solutions:
- 400: Invalid format/size - validate before upload
- 401: Invalid API key - check configuration
- 403: Permission denied - verify API key restrictions
- 404: File not found - ensure file uploaded and active
- 429: Rate limit exceeded - implement exponential backoff
- 500: Server error - retry with backoff
Scripts Overview
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
- Supports all modalities (audio, image, video, PDF)
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode
media_optimizer.py: Prepare media for Gemini API
- Compress videos/audio for size limits
- Resize images appropriately
- Split long videos into chunks
- Format conversion
- Quality vs size optimization
document_converter.py: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support
Run any script with
--help for detailed usage.