Awesome-omni-skill media-generation
Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/content-media/media-generation" ~/.claude/skills/diegosouzapw-awesome-omni-skill-media-generation-15a0a4 && rm -rf "$T"
manifest:
skills/content-media/media-generation/SKILL.mdsource content
Media Generation
Image Generation
uv run ~/.claude/skills/media-generation/scripts/generate_image.py \ --prompt "description or editing instructions" \ --filename "output.png" \ [--input-image "source.png"] \ [--resolution 1K|2K|4K]
Resolution
(default) — also for: "low res", "1080p"1K
— also for: "medium", "2048"2K
— also for: "high res", "hi-res", "ultra"4K
Video Generation
uv run ~/.claude/skills/media-generation/scripts/generate_video.py \ --prompt "video description" \ --filename "output.mp4" \ [--model veo-3.0-generate-preview] \ [--negative "things to avoid"] \ [--input-image "first-frame.png"]
Models
(default) — stable, video onlyveo-3.0-generate-001
— faster, lower costveo-3.0-fast-generate-001
— supports video extend, audio syncveo-3.1-generate-preview
— fast with extend supportveo-3.1-fast-generate-preview
Prompting Tips
- Specify camera movements:
"slow zoom in", "pan left", "close-up" - Add
if character shouldn't speak"no talking, no dialogue" - Describe atmosphere:
"rain outside", "purple mystical energy"
Note: Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.
Music Video from Image + Audio
Overview
- Start with character image + audio track (e.g., from Suno)
- Transcribe audio to get timestamps
- Generate clip 1 from image (veo-3.1)
- Extend each subsequent clip from previous (maintains continuity)
- Stitch clips + overlay audio with ffmpeg
Step 1: Transcribe audio for timing
whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt
Step 2: Generate first clip from image
# Use veo-3.1 (required for extend feature) operation = client.models.generate_videos( model="veo-3.1-generate-preview", image=types.Image(image_bytes=img_data, mime_type="image/jpeg"), prompt="character description, scene action, no talking", ) video1 = operation.result.generated_videos[0]
Step 3: Extend from previous clip
operation = client.models.generate_videos( model="veo-3.1-generate-preview", video=previous_video.video, # Pass previous video object prompt="next scene description, continuous action, no talking", )
Step 4: Stitch clips + add audio
# Create concat list printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt # Stitch video clips ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4 # Add audio track ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4
Cost estimate
- ~8 sec per clip × $0.40/sec = $3.20/clip
- 4-min song ≈ 30 clips ≈ $96
Audio Generation
- Music: Use Suno (external service)
- Speech: Gemini 2.5 TTS (Flash or Pro) - TBD script
API Key
Uses
GEMINI_API_KEY env var, or pass --api-key KEY.