Skills ai-media

ai-media - AI Media Generation

install

source · Clone the upstream repo

git clone https://github.com/openclaw/skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bowen31337/ai-media" ~/.claude/skills/clawdbot-skills-ai-media && rm -rf "$T"

manifest: skills/bowen31337/ai-media/SKILL.md

source content

ai-media - AI Media Generation

Full-stack AI media generation powered by GPU server (RTX 3090/3080/2070S).

Capabilities

Image Generation — Photorealistic images via ComfyUI (z-image, Juggernaut XL)
Video Generation — Video synthesis via ComfyUI (AnimateDiff, LTX-2)
Talking Heads — Animated talking faces via SadTalker
Voice Synthesis — Natural TTS via Voxtral (whisper.cpp)

GPU Server

Host:
```
${GPU_USER}@${GPU_HOST}
```
SSH Key:
```
~/.ssh/id_ed25519_gpu
```
ComfyUI:
```
/data/ai-stack/comfyui/ComfyUI/
```
(port 8188)
SadTalker:
```
/data/ai-stack/sadtalker/
```
Voxtral:
```
/data/ai-stack/whisper/
```
Output:
```
/data/ai-stack/output/
```

Usage

Generate Image

./scripts/image.sh "lady on beach at sunset" realistic
./scripts/image.sh "cyberpunk cityscape" artistic

Arguments:

```
$1
```
: Prompt text
```
$2
```
: Style (realistic|artistic) — optional, default: realistic

Output: Path to generated image (e.g.,

/data/ai-stack/output/image_001.png

)

Generate Video

./scripts/video.sh "waves crashing on shore" animatediff 4
./scripts/video.sh "city traffic timelapse" ltx2 8

Arguments:

```
$1
```
: Prompt text
```
$2
```
: Model (animatediff|ltx2) — optional, default: animatediff
```
$3
```
: Duration in seconds — optional, default: 4

Output: Path to generated video (e.g.,

/data/ai-stack/output/video_001.mp4

)

Generate Talking Head

./scripts/talking-head.sh "Hello, I'm Agent" gentle input.jpg
./scripts/talking-head.sh "Welcome to the future" neutral photo.png

Arguments:

```
$1
```
: Speech text
```
$2
```
: Voice style (gentle|neutral|energetic) — optional, default: gentle
```
$3
```
: Avatar image path — optional, generates default if not provided

Output: Path to talking head video (e.g.,

/data/ai-stack/output/talking_001.mp4

)

Generate Audio

./scripts/audio.sh "This is a test message" en male
./scripts/audio.sh "Bonjour le monde" fr female

Arguments:

```
$1
```
: Text to speak
```
$2
```
: Language code (en|fr|es|etc) — optional, default: en
```
$3
```
: Voice gender (male|female) — optional, default: male

Output: Path to audio file (e.g.,

/data/ai-stack/output/audio_001.wav

)

Models Available

Image Models

z-image — 6B params, S3-DiT, photorealistic (downloading, 43% complete)
Juggernaut XL v9 — SDXL-based, versatile (7.1GB, ready)

Video Models

AnimateDiff — SD 1.5 motion module (512x512, working ✅)
LTX-2 — 19B params, high quality (14GB checkpoint ready, Gemma encoder ready)

Talking Head Models

SadTalker — Audio-driven head animation (working ✅)

Voice Models

Voxtral — whisper.cpp-based TTS (installed)

Dependencies

All dependencies are pre-installed on GPU server:

ComfyUI with custom nodes (AnimateDiff-Evolved, VideoHelperSuite)
SadTalker with face enhancer
Voxtral with whisper.cpp
FFmpeg for video encoding

Error Handling

Scripts will:

Check SSH connectivity before execution
Validate GPU server is running
Return meaningful error messages
Clean up failed generations automatically

Performance

Image: ~10-20s for 1024x1024
Video (AnimateDiff): ~20-30s for 512x512, 16 frames
Video (LTX-2): ~60-90s for 768x512, 4s @ 24fps
Talking Head: ~30-40s for 10s video
Audio: ~2-5s for 30s speech

Future Enhancements

Batch generation support
Style transfer capabilities
Video upscaling (spatial + temporal)
Multi-language voice cloning
Real-time preview streaming

Status: Active development Maintainer: Agent GPU Server: ${GPU_USER}@${GPU_HOST}