git clone https://github.com/ZkAGI/zkagi-video-engine
T=$(mktemp -d) && git clone --depth=1 https://github.com/ZkAGI/zkagi-video-engine "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude" ~/.claude/skills/zkagi-zkagi-video-engine-claude && rm -rf "$T"
.claude/LTX2-SKILL.md- makes HTTP requests (curl)
LTX-2.3 Video Generation Skill — ComfyUI Integration
Overview
LTX-2.3 is a 22B parameter audio-video foundation model by Lightricks. It generates synchronized video AND audio in a single pass. Running on RTX 5090 (32GB VRAM) via ComfyUI Desktop. Uses DualCLIPLoader for its dual text encoder architecture.
Connection:
COMFY_URL="http://$(ip route show default | awk '{print $3}'):8001"
Capabilities
- Text-to-video — generate video from text prompt
- Image-to-video — animate a reference image (BEST QUALITY)
- Audio-video sync — generates matching audio with video
- Video extension — extend existing clips
- Spatial upscale — 2x resolution upscale with dedicated model
- Distilled fast mode — 8-step generation (vs 20-step normal)
- LoRA support — style/motion/likeness LoRAs
- Up to 4K resolution, 50 FPS, 20 seconds per clip
Models Available on This System
checkpoints/ → ltx-2.3-22b-dev-fp8.safetensors text_encoders/ → gemma_3_12B_it.safetensors (reused from LTX-2) text_encoders/ → ltx-2.3_text_projection_bf16.safetensors (new text projection) latent_upscale/ → ltx-2-spatial-upscaler-x2-1.0.safetensors (check if available) loras/ → ltx-2.3-distilled-lora-384.safetensors (check if available)
Backward compatibility: Old LTX-2 (19B) files remain on the system as fallback (
ltx-2-19b-dev-fp8.safetensors, ltx-2-19b-distilled-lora-384.safetensors).
Always verify at runtime:
curl -s "$COMFY_URL/object_info/DualCLIPLoader" | python3 -c " import sys,json; d=json.load(sys.stdin)['DualCLIPLoader']['input']['required'] print('Clip name 1:', d.get('clip_name1', ['?'])[0]) print('Clip name 2:', d.get('clip_name2', ['?'])[0]) print('Type:', d.get('type', ['?'])[0]) "
Recommended Settings
Resolution (BEFORE upscale — actual output is 2x)
| Aspect | Width | Height | Output after 2x |
|---|---|---|---|
| 3:2 | 768 | 512 | 1536×1024 |
| 16:9 | 768 | 432 | 1536×864 |
| 1:1 | 640 | 640 | 1280×1280 |
| 9:16 | 432 | 768 | 864×1536 |
Width & height must be divisible by 32.
Frames & Duration
| Frames | Duration @25fps | Use case |
|---|---|---|
| 97 | 3.88s | Quick cuts, transitions |
| 121 | 4.84s | Standard scene clip |
| 161 | 6.44s | Extended scene |
| 201 | 8.04s | Long scene |
| 257 | 10.28s | Maximum length (highest VRAM) |
Frame count must be 8n+1 (97, 105, 113, 121, 129, ... 257).
Quality Settings
| Mode | Steps | CFG | LoRA | Speed | Quality |
|---|---|---|---|---|---|
| Full quality | 20 | 4.0 | none | Slow (~3min) | Best |
| Distilled fast | 8 | 1.0 | distilled-lora | Fast (~1min) | Great |
| Quick preview | 4 | 1.0 | distilled-lora | Very fast | Good |
Recommendation: Use distilled-lora (8 steps) for ALL production use. Quality is comparable to 20 steps and much more stable.
FPS Guidelines
- Motion-heavy (action, transitions, particles): 25-30 FPS
- Static/close-up (face, object, slow pan): 15-24 FPS
- Default: 25 FPS
PROMPTING GUIDE (CRITICAL)
LTX-2.3 responds to DESCRIPTIVE, NOVEL-LIKE prompts. Not keywords.
Structure of a GOOD prompt:
[SCENE SETUP] + [SUBJECT DESCRIPTION] + [CAMERA MOVEMENT] + [MOTION/ACTION] + [ATMOSPHERE/MOOD] + [AUDIO DESCRIPTION]
Examples:
For a crypto/tech scene: "A close-up shot of a glowing digital vault slowly opening, revealing streams of golden light pouring out. The camera pushes forward through the vault door as holographic numbers and symbols float past. Particles of light drift upward like fireflies. The atmosphere is futuristic and mysterious, with deep blue and purple tones. A deep electronic hum builds as the vault opens, followed by a crystalline chime."
For an action/reveal scene: "A dramatic wide shot of digital chains shattering in slow motion, each link exploding into fragments of light. The camera pulls back to reveal a figure standing free, surrounded by floating shards that catch the light. Sparks cascade downward. The mood shifts from dark and oppressive to bright and liberating. The sound of breaking glass echoes, followed by a rising orchestral swell."
For a peaceful/trust scene: "A slow, sweeping aerial shot over a serene digital landscape, with rolling hills made of soft gradients and trees rendered in gentle geometric shapes. Soft clouds drift across the scene. The camera glides forward smoothly, descending toward a glowing structure in the distance. Warm golden light bathes everything. Gentle ambient music plays with soft piano notes and nature sounds."
BAD prompt habits (AVOID):
- ❌ "crypto wallet security" — too vague, keyword-like
- ❌ "a shield" — no motion, no camera, no atmosphere
- ❌ "neon city" — static description
- ❌ Short prompts under 20 words — LTX-2.3 needs detail
Motion keywords to weave into prompts:
Camera: "slowly pushing forward", "dolly shot", "orbiting around", "pulling back to reveal", "tracking shot following", "pan left", "crane up", "close-up transitioning to wide" Elements: "particles drifting", "light rays sweeping", "energy pulsing", "waves rippling", "fragments floating", "sparks cascading" Transitions: "shifting from dark to bright", "focus pulling", "morphing into", "dissolving away"
Audio in prompts (LTX-2.3 generates audio too!):
Include audio descriptions at the end of prompts:
- "The sound of digital beeps and a low electronic hum"
- "A rising orchestral swell with deep bass"
- "Gentle ambient music with soft chimes"
- "The crackle of energy and a dramatic boom"
PRODUCTION WORKFLOW: Image-to-Video with Hires.fix
This is the OPTIMAL pipeline. 3 stages:
Stage 1: Generate base video from reference image
workflow = { # Load checkpoint (model + VAE) "1": { "class_type": "CheckpointLoaderSimple", "inputs": { "ckpt_name": "ltx-2.3-22b-dev-fp8.safetensors" } }, # Load dual text encoders via DualCLIPLoader "2": { "class_type": "DualCLIPLoader", "inputs": { "clip_name1": "gemma_3_12B_it.safetensors", "clip_name2": "ltx-2.3_text_projection_bf16.safetensors", "type": "ltx" } }, # Load + preprocess reference image "3": { "class_type": "LoadImage", "inputs": {"image": "UPLOADED_IMAGE_NAME.png"} }, "4": { "class_type": "LTXVPreprocess", "inputs": {"image": ["3", 0]} }, # Positive prompt (MOTION-FOCUSED) "5": { "class_type": "CLIPTextEncode", "inputs": { "text": "NOVEL-LIKE MOTION PROMPT WITH AUDIO DESCRIPTION", "clip": ["2", 0] } }, # Negative prompt "6": { "class_type": "CLIPTextEncode", "inputs": { "text": "static, frozen, no motion, blurry, low quality, distorted, text, watermark, jittery, flickering, ugly, deformed", "clip": ["2", 0] } }, # LTX conditioning with frame rate "7": { "class_type": "LTXVConditioning", "inputs": { "positive": ["5", 0], "negative": ["6", 0], "frame_rate": 25 } }, # Empty video latent "8": { "class_type": "EmptyLTXVLatentVideo", "inputs": { "width": 768, # Half of final output (gets 2x upscaled) "height": 512, "length": 121, # 4.84s at 25fps — adjust per scene "batch_size": 1 } }, # Empty audio latent "9": { "class_type": "LTXVEmptyLatentAudio", "inputs": { "length": 121 # Match video length } }, # Combine video + audio latents "10": { "class_type": "LTXVConcatAVLatent", "inputs": { "a": ["8", 0], # video latent "b": ["9", 0] # audio latent } }, # Insert reference image as first frame "11": { "class_type": "LTXVImgToVideo", "inputs": { "positive": ["7", 0], "negative": ["7", 1], "vae": ["1", 2], "image": ["4", 0], # preprocessed image "width": 768, "height": 512, "length": 121, "batch_size": 1 } }, # LTX scheduler "12": { "class_type": "LTXVScheduler", "inputs": { "steps": 20, "max_shift": 2.05, "base_shift": 0.95, "stretch": True, "terminal": 0.1, "latent": ["11", 0] } }, # Noise "13": { "class_type": "RandomNoise", "inputs": {"noise_seed": RANDOM_SEED} }, # Guider "14": { "class_type": "CFGGuider", "inputs": { "model": ["1", 0], "positive": ["7", 0], "negative": ["7", 1], "cfg": 4.0 } }, # Sampler "15": { "class_type": "KSamplerSelect", "inputs": {"sampler_name": "euler"} }, # Sample "16": { "class_type": "SamplerCustomAdvanced", "inputs": { "noise": ["13", 0], "guider": ["14", 0], "sampler": ["15", 0], "sigmas": ["12", 0], "latent_image": ["11", 0] } }, # STAGE 2: HIRES.FIX — Spatial Upscale 2x (if upscaler model available) # Check: curl -s "$COMFY_URL/object_info/LTXVLatentUpsampler" # "17": { # "class_type": "LTXVLatentUpsampler", # "inputs": { # "upscale_model": "ltx-2-spatial-upscaler-x2-1.0.safetensors", # "samples": ["16", 0] # output from sampler # } # }, # STAGE 3: DECODE — Separate and decode video + audio "17": { "class_type": "LTXVSeparateAVLatent", "inputs": {"samples": ["16", 0]} }, "18": { "class_type": "VAEDecode", "inputs": { "samples": ["17", 0], # video latent "vae": ["1", 2] } }, # Save video with audio "19": { "class_type": "SaveVideo", "inputs": { "filename_prefix": "scene_INDEX", "images": ["18", 0], "fps": 25 } } }
Using Distilled LoRA for Speed (RECOMMENDED)
If
ltx-2.3-distilled-lora-384.safetensors is available:
- Add LoRA loader node after model load
- Change steps from 20 → 8
- Change CFG from 4.0 → 1.0
- Change scheduler to "Simple" (or keep LTXVScheduler but adjust)
- This produces FASTER generation with comparable quality
Audio Decode (if LTXVAudioVAELoader + LTXVAudioVAEDecode available)
# Add these nodes to also extract generated audio "20": { "class_type": "LTXVAudioVAELoader", "inputs": {} }, "21": { "class_type": "LTXVAudioVAEDecode", "inputs": { "samples": ["17", 1], # audio latent from SeparateAVLatent "vae": ["20", 0] } } # The generated audio includes scene-appropriate sounds!
SUBMITTING & POLLING WORKFLOWS
Submit
PROMPT_ID=$(curl -s -X POST "$COMFY_URL/prompt" \ -H "Content-Type: application/json" \ -d "{\"prompt\": $WORKFLOW_JSON}" | python3 -c "import sys,json; print(json.load(sys.stdin)['prompt_id'])") echo "Submitted: $PROMPT_ID"
Poll for completion
while true; do STATUS=$(curl -s "$COMFY_URL/history/$PROMPT_ID" | python3 -c " import sys,json d=json.load(sys.stdin) if '$PROMPT_ID' in d: s = d['$PROMPT_ID'].get('status',{}).get('status_str','') outputs = d['$PROMPT_ID'].get('outputs', {}) if s == 'error': print('ERROR') elif outputs: print('DONE') else: print('RUNNING') else: print('WAITING') ") echo "Status: $STATUS" if [ "$STATUS" = "DONE" ] || [ "$STATUS" = "ERROR" ]; then break; fi sleep 3 done
Download output
FILENAME=$(curl -s "$COMFY_URL/history/$PROMPT_ID" | python3 -c " import sys,json d=json.load(sys.stdin)['$PROMPT_ID']['outputs'] for nid, out in d.items(): for key in ['gifs','videos','images']: if key in out: for item in out[key]: fn = item.get('filename','') if fn.endswith('.mp4') or fn.endswith('.webm'): print(fn); exit() ") curl -s "$COMFY_URL/view?filename=$FILENAME&type=output" --output public/scenes/scene-{i}-a.mp4 ffprobe -v error -show_entries format=duration -of csv=p=0 public/scenes/scene-{i}-a.mp4
FILLING SCENE DURATION (CRITICAL)
Each LTX-2.3 clip is 4-10 seconds. TTS audio is 8-15 seconds per scene.
Generate MULTIPLE clips per scene with DIFFERENT motion prompts:
Example: 12-second scene about "wallet security"
Sub-clip A (0-5s): LTX-2.3 image-to-video — "camera pushing into glowing vault, particles rising, dramatic reveal, electronic hum building" Sub-clip B (5-9s): LTX-2.3 image-to-video — "keys splitting into glowing fragments orbiting around center, camera slowly orbiting, crystalline chimes" Sub-clip C (9-12s): LTX-2.3 image-to-video — "shield materializing with rippling energy, camera pulling back, triumphant orchestral swell"
Rules:
- NEVER loop a single clip
- NEVER use static images with Ken Burns as filler — generate MORE video clips
- Each sub-clip gets a DIFFERENT motion prompt (different camera angle, different action)
- Generate the reference image for each sub-clip from the image gen API with a different angle/moment
- Crossfade between sub-clips in Remotion (10-15 frames)
- Total sub-clip durations must >= audio duration
File naming:
public/scenes/ ├── scene-0-a.mp4 ← first clip ├── scene-0-b.mp4 ← second clip ├── scene-0-c.mp4 ← third clip (if needed) ├── scene-1-a.mp4 ├── scene-1-b.mp4 └── ...
VIDEO QUALITY TIPS
- Image-to-video >> Text-to-video — Always provide a reference image for this
- LTXVPreprocess the image — Intentionally degrades to look like video compression, preventing quality mismatch between input image and generated frames
- Use Hires.fix — 2x spatial upscale with LTXVLatentUpsampler if available
- Distilled LoRA — Use 8-step distilled for stable, fast results
- Longer prompts = better — Describe like a novel, not keywords
- Include audio in prompt — LTX-2.3 generates matching sound effects
- Frame count 121-161 — Best quality/speed balance on 32GB VRAM
- Width/height divisible by 32 — Required by model architecture
- Frame count = 8n+1 — Required (97, 105, 113, 121, 129, 137, 145, 153, 161, 169, ... 257)
- Motion LoRAs — If video barely moves, a motion LoRA helps (check custom_nodes for IC-LoRA)