Awesome-omni-skill media-generation

Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/content-media/media-generation" ~/.claude/skills/diegosouzapw-awesome-omni-skill-media-generation-15a0a4 && rm -rf "$T"

manifest: skills/content-media/media-generation/SKILL.md

source content

Media Generation

Image Generation

uv run ~/.claude/skills/media-generation/scripts/generate_image.py \
  --prompt "description or editing instructions" \
  --filename "output.png" \
  [--input-image "source.png"] \
  [--resolution 1K|2K|4K]

Resolution

```
1K
```
(default) — also for: "low res", "1080p"
```
2K
```
— also for: "medium", "2048"
```
4K
```
— also for: "high res", "hi-res", "ultra"

Video Generation

uv run ~/.claude/skills/media-generation/scripts/generate_video.py \
  --prompt "video description" \
  --filename "output.mp4" \
  [--model veo-3.0-generate-preview] \
  [--negative "things to avoid"] \
  [--input-image "first-frame.png"]

Models

```
veo-3.0-generate-001
```
(default) — stable, video only
```
veo-3.0-fast-generate-001
```
— faster, lower cost
```
veo-3.1-generate-preview
```
— supports video extend, audio sync
```
veo-3.1-fast-generate-preview
```
— fast with extend support

Prompting Tips

Specify camera movements:
```
"slow zoom in", "pan left", "close-up"
```
Add
```
"no talking, no dialogue"
```
if character shouldn't speak

Describe atmosphere:

"rain outside", "purple mystical energy"

Note: Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.

Music Video from Image + Audio

Overview

Start with character image + audio track (e.g., from Suno)
Transcribe audio to get timestamps
Generate clip 1 from image (veo-3.1)
Extend each subsequent clip from previous (maintains continuity)
Stitch clips + overlay audio with ffmpeg

Step 1: Transcribe audio for timing

whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt

Step 2: Generate first clip from image

# Use veo-3.1 (required for extend feature)
operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    image=types.Image(image_bytes=img_data, mime_type="image/jpeg"),
    prompt="character description, scene action, no talking",
)
video1 = operation.result.generated_videos[0]

Step 3: Extend from previous clip

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    video=previous_video.video,  # Pass previous video object
    prompt="next scene description, continuous action, no talking",
)

Step 4: Stitch clips + add audio

# Create concat list
printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt

# Stitch video clips
ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4

# Add audio track
ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4

Cost estimate

~8 sec per clip × $0.40/sec = $3.20/clip
4-min song ≈ 30 clips ≈ $96

Audio Generation

Music: Use Suno (external service)
Speech: Gemini 2.5 TTS (Flash or Pro) - TBD script

API Key

Uses

GEMINI_API_KEY

env var, or pass

--api-key KEY