EasyPlatform ai-multimodal
[AI & Tools] Process and generate multimedia content using Google Gemini API -- vision analysis, audio transcription, video processing, document extraction, image/video generation. Triggers on multimodal, vision API, image recognition, audio transcription, video analysis, gemini, imagen, document extraction.
git clone https://github.com/duc01226/EasyPlatform
T=$(mktemp -d) && git clone --depth=1 https://github.com/duc01226/EasyPlatform "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/ai-multimodal" ~/.claude/skills/duc01226-easyplatform-ai-multimodal && rm -rf "$T"
.claude/skills/ai-multimodal/SKILL.md<!-- SYNC:critical-thinking-mindset -->[IMPORTANT] Use
to break ALL work into small tasks BEFORE starting — including tasks for each file read. This prevents context loss from long files. For simple tasks, AI MUST ATTENTION ask user whether to skip.TaskCreate
<!-- /SYNC:critical-thinking-mindset --> <!-- SYNC:ai-mistake-prevention -->Critical Thinking Mindset — Apply critical thinking, sequential thinking. Every claim needs traced proof, confidence >80% to act. Anti-hallucination: Never present guess as fact — cite sources for every claim, admit uncertainty freely, self-check output for errors, cross-reference independently, stay skeptical of own confidence — certainty without evidence root of all hallucination.
<!-- /SYNC:ai-mistake-prevention -->AI Mistake Prevention — Failure modes to avoid on every task:
- Check downstream references before deleting. Deleting components causes documentation and code staleness cascades. Map all referencing files before removal.
- Verify AI-generated content against actual code. AI hallucinates APIs, class names, and method signatures. Always grep to confirm existence before documenting or referencing.
- Trace full dependency chain after edits. Changing a definition misses downstream variables and consumers derived from it. Always trace the full chain.
- Trace ALL code paths when verifying correctness. Confirming code exists is not confirming it executes. Always trace early exits, error branches, and conditional skips — not just happy path.
- When debugging, ask "whose responsibility?" before fixing. Trace whether bug is in caller (wrong data) or callee (wrong handling). Fix at responsible layer — never patch symptom site.
- Assume existing values are intentional — ask WHY before changing. Before changing any constant, limit, flag, or pattern: read comments, check git blame, examine surrounding code.
- Verify ALL affected outputs, not just the first. Changes touching multiple stacks require verifying EVERY output. One green check is not all green checks.
- Holistic-first debugging — resist nearest-attention trap. When investigating any failure, list EVERY precondition first (config, env vars, DB names, endpoints, DI registrations, data preconditions), then verify each against evidence before forming any code-layer hypothesis.
- Surgical changes — apply the diff test. Bug fix: every changed line must trace directly to the bug. Don't restyle or improve adjacent code. Enhancement task: implement improvements AND announce them explicitly.
- Surface ambiguity before coding — don't pick silently. If request has multiple interpretations, present each with effort estimate and ask. Never assume all-records, file-based, or more complex path.
Quick Summary
Goal: Process and generate multimedia content (images, audio, video, documents) using Google Gemini API via Python scripts.
Workflow:
- Identify Modality — Match input type to task (analyze, transcribe, extract, generate)
- Check Limits — Inline max 20MB, File API max 2GB; split large audio at 15min chunks
- Execute — Run
with appropriate task and filesgemini_batch_process.py - Post-Process — Format output as markdown with timestamps, save generated content
Key Rules:
- Requires
environment variableGEMINI_API_KEY - Always request specific nodes/files, avoid full-file downloads
- Use
to compress/split files exceeding limitsmedia_optimizer.py
Be skeptical. Apply critical thinking, sequential thinking. Every claim needs traced proof, confidence percentages (Idea should be more than 80%).
AI Multimodal
Purpose
Process audio, images, videos, and documents or generate images/videos using Google Gemini's multimodal API via bundled Python scripts.
When to Use
- Analyzing images or screenshots (Gemini vision is preferred over Claude's built-in vision for complex tasks)
- Transcribing audio files (meetings, podcasts, interviews)
- Extracting data from PDFs, scanned documents, or charts
- Processing video content (scene detection, temporal Q&A)
- Generating images with Imagen 4 or videos with Veo 3
- Converting documents to markdown with visual understanding
When NOT to Use
- Simple text-only LLM calls -- use Claude directly
- Reading a file Claude can already read (code, markdown, JSON) -- use
toolRead - Building AI-powered application features -- use
orapi-designfrontend-design - Music composition workflows -- load
only when specifically requestedreferences/music-generation.md - General prompt engineering -- use
skillai-artist
Prerequisites
export GEMINI_API_KEY="your-key" # From https://aistudio.google.com/apikey pip install google-genai python-dotenv pillow python scripts/check_setup.py # Verify setup
Optional: API key rotation for rate limits (set
GEMINI_API_KEY_2, GEMINI_API_KEY_3).
Workflow
Step 1: Identify Modality
| Input Type | Task | Command |
|---|---|---|
| Image (PNG/JPG/WEBP) | Analyze, caption, OCR | |
| Audio (WAV/MP3/AAC) | Transcribe, summarize | |
| Video (MP4/MOV) | Scene detection, Q&A | |
| PDF/Document | Extract tables, forms | |
| Text prompt | Generate image | |
| Text prompt | Generate video | |
Step 2: Check Limits
- Inline upload: max 20MB
- File API: max 2GB (auto-used for large files)
- Audio transcription: split at 15-minute chunks for full transcript
- Video transcription: extract audio first, then split and transcribe
- Formats: Audio (WAV/MP3/AAC, up to 9.5h), Images (PNG/JPEG/WEBP, up to 3.6k), Video (MP4/MOV, up to 6h), PDF (up to 1k pages)
IF file exceeds limits, use
scripts/media_optimizer.py to compress/split first.
Step 3: Execute
Quick check: If
gemini CLI is available, use: "<prompt>" | gemini -y -m gemini-2.5-flash
Standard: Use the batch processing script:
# Analyze media python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract> # Generate content python scripts/gemini_batch_process.py --task generate --prompt "description" python scripts/gemini_batch_process.py --task generate-video --prompt "description"
Stdin support:
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"
Step 4: Post-Processing
- For transcripts: output in markdown with
timestamps[HH:MM:SS -> HH:MM:SS] - For document extraction: save as structured markdown under
docs/assets/ - For generated images/videos: save to working directory with descriptive filename
Step 5: Verification
- Confirm output matches expected format and completeness
- For long transcripts: verify no truncation occurred (check chunk boundaries)
- For generated content: verify quality meets prompt requirements
Models
| Purpose | Model | Notes |
|---|---|---|
| Analysis (fast) | | Recommended default |
| Analysis (advanced) | | Complex reasoning tasks |
| Image generation | | Standard quality |
| Image generation (quality) | | Best quality |
| Image generation (speed) | | Fastest |
| Video generation | | 8s clips with audio |
Scripts Reference
-- CLI orchestrator for all tasks, auto-resolves API keys and modelsgemini_batch_process.py
-- Compress/resize/split media to fit Gemini limitsmedia_optimizer.py
-- Convert PDFs/images/Office docs to markdowndocument_converter.py
-- Verify environment, dependencies, and API keycheck_setup.py
Use
--help on any script for full options.
Examples
Example 1: Transcribe a Meeting Recording
Input: 45-minute meeting audio file
meeting-2025-01-15.mp3
Steps:
- File is >15min, so split first:
python scripts/media_optimizer.py --input meeting-2025-01-15.mp3 --split-duration 900 - Transcribe each chunk:
python scripts/gemini_batch_process.py --files meeting-part-*.mp3 --task transcribe - Output: Markdown file with timestamps, speaker detection, and metadata (duration, topics covered)
Example 2: Extract Data from a PDF Report
Input: Quarterly HR report PDF with tables, charts, and forms
Steps:
- Convert and extract:
python scripts/document_converter.py --input quarterly-report.pdf --output docs/assets/ - Output: Structured markdown with tables preserved, chart descriptions, and form field values extracted
Detailed References
Load for in-depth guidance:
| Topic | File |
|---|---|
| Audio processing | |
| Vision/image analysis | |
| Image generation | |
| Video analysis | |
| Video generation | |
| Music generation | |
Related Skills
-- for prompt engineering and optimization (not media processing)ai-artist
-- for FFmpeg-based audio/video encoding without AImedia-processing
-- for simple PDF text extraction without vision AIpdf-to-markdown
Closing Reminders
- MANDATORY IMPORTANT MUST ATTENTION break work into small todo tasks using
BEFORE startingTaskCreate - MANDATORY IMPORTANT MUST ATTENTION search codebase for 3+ similar patterns before creating new code
- MANDATORY IMPORTANT MUST ATTENTION cite
evidence for every claim (confidence >80% to act)file:line - MANDATORY IMPORTANT MUST ATTENTION add a final review todo task to verify work quality <!-- SYNC:critical-thinking-mindset:reminder -->
- MUST ATTENTION apply critical thinking — every claim needs traced proof, confidence >80% to act. Anti-hallucination: never present guess as fact. <!-- /SYNC:critical-thinking-mindset:reminder --> <!-- SYNC:ai-mistake-prevention:reminder -->
- MUST ATTENTION apply AI mistake prevention — holistic-first debugging, fix at responsible layer, surface ambiguity before coding, re-read files after compaction. <!-- /SYNC:ai-mistake-prevention:reminder -->