OpenMontage grok-media
xAI Grok image and video generation guide covering authentication, endpoints, prompt structure, image editing, reference-image video, and async polling.
install
source · Clone the upstream repo
git clone https://github.com/calesthio/OpenMontage
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/calesthio/OpenMontage "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/grok-media" ~/.claude/skills/calesthio-openmontage-grok-media && rm -rf "$T"
manifest:
.agents/skills/grok-media/SKILL.mdsource content
Grok Media
Use this skill when working with xAI media models in OpenMontage.
Models
for image generation and image editinggrok-imagine-image
for text-to-video, image-to-video, and reference-image videogrok-imagine-video
Authentication
- Env var:
XAI_API_KEY - Base URL:
https://api.x.ai/v1 - Header:
Authorization: Bearer $XAI_API_KEY
Image API
Text-to-image
- Endpoint:
POST /images/generations - Core fields:
modelpromptnaspect_ratioresolution
Image edit
- Endpoint:
POST /images/edits - Use
for one source imageimage - Use
for multi-image compositingimages - Each source image can be:
- a public HTTPS URL
- a base64 data URI
Image prompting
- Grok responds well to direct natural language
- For edits, describe only the intended change and preserve everything else implicitly
- For multi-image merges, explicitly name how each source contributes
- Prefer one strong scene description over long style-stacking
Video API
Generation
- Endpoint:
POST /videos/generations - Polling endpoint:
GET /videos/{request_id} - Success state:
status == "done" - Failure states to handle explicitly:
,failedexpired
Modes
- Text-to-video:
- prompt-only generation
- Image-to-video:
- use
image: {"url": ...} - this anchors the starting frame
- use
- Reference-to-video:
- use
reference_images: [{"url": ...}, ...] - this influences who/what appears in the video without locking the first frame
- prompts can reference inputs with placeholders like
,<IMAGE_1><IMAGE_2>
- use
Video constraints
- Grok video is best treated as short-form generation
- Current output resolutions are
and480p720p - Reference-image video supports multiple images and is useful for product placement, wardrobe transfer, and identity consistency
- Download outputs promptly; provider URLs may be temporary
Pricing
:grok-imagine-image
per generated image$0.02
edits/composites: addgrok-imagine-image
per input image$0.002
:grok-imagine-video
:480p
per second$0.05
:720p
per second$0.07
image-conditioned requests: addgrok-imagine-video
per input image$0.002
Grok-Specific Prompt Guidance
Images
- Start with subject, action, setting
- Add one style anchor, not five
- For edits:
- describe the desired modification
- keep the rest of the image stable by omission, not by writing a giant preservation list
Video
- Keep prompts scene-local: one shot, one main motion idea, one emotional beat
- For reference-conditioned video, explicitly map source images to roles:
- person from
<IMAGE_1> - jacket from
<IMAGE_2> - product from
<IMAGE_3>
- person from
- Camera and pacing language helps:
- slow push-in
- handheld follow
- locked-off medium shot
- high-energy whip pan transition
Good Fits
- Image style transfer
- Image compositing from multiple sources
- Reference-conditioned short video
- Product-led motion clips
- Character-consistent scenes without hard first-frame lock
Weak Fits
- Long-form clip generation
- Heavy reliance on deterministic seeds
- Overloaded prompts with multiple scene changes
Failure Handling
- If generation submission succeeds but polling expires, surface it as a provider/runtime issue
- If a request fails, preserve the endpoint, mode, and prompt summary in the error
- Do not silently substitute a different provider after xAI was selected without user approval