OpenMontage video-understand

install
source · Clone the upstream repo
git clone https://github.com/calesthio/OpenMontage
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/calesthio/OpenMontage "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/video-understand" ~/.claude/skills/calesthio-openmontage-video-understand-3ae2a4 && rm -rf "$T"
manifest: .claude/skills/video-understand/SKILL.md
source content

video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Prerequisites

  • ffmpeg
    +
    ffprobe
    (required):
    brew install ffmpeg
  • openai-whisper
    (optional, for transcription):
    pip install openai-whisper

Commands

# Scene detection + transcribe (default)
python3 skills/video-understand/scripts/understand_video.py video.mp4

# Keyframe extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

# Regular interval extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

# Limit frames extracted
python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

# Use a larger Whisper model
python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

# Frames only, skip transcription
python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

# Quiet mode (JSON only, no progress)
python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

# Output to file
python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

CLI Options

FlagDescription
video
Input video file (positional, required)
-m, --mode
Extraction mode:
scene
(default),
keyframe
,
interval
--max-frames
Maximum frames to keep (default: 20)
--whisper-model
Whisper model size: tiny, base, small, medium, large (default: base)
--no-transcribe
Skip audio transcription, extract frames only
-o, --output
Write result JSON to file instead of stdout
-q, --quiet
Suppress progress messages, output only JSON

Extraction Modes

ModeHow it worksBest for
scene
Detects scene changes via ffmpeg
select='gt(scene,0.3)'
Most videos, varied content
keyframe
Extracts I-frames (codec keyframes)Encoded video with natural keyframe placement
interval
Evenly spaced frames based on duration and max-framesFixed sampling, predictable output

If

scene
mode detects no scene changes, it automatically falls back to
interval
mode.

Output

The script outputs JSON to stdout (or file with

-o
). See
references/output-format.md
for the full schema.

{
  "video": "video.mp4",
  "duration": 18.076,
  "resolution": {"width": 1224, "height": 1080},
  "mode": "scene",
  "frames": [
    {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}
  ],
  "frame_count": 12,
  "transcript": [
    {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}
  ],
  "text": "Full transcript...",
  "note": "Use the Read tool to view frame images for visual understanding."
}

Use the Read tool on frame image paths to visually inspect extracted frames.

References

  • references/output-format.md
    -- Full JSON output schema documentation