Q-skills q-multimodal

Extract visual, video, and audio features from media. Use for pixel features (Pillow), video frames (FFmpeg+Pillow), audio features (openSMILE), and visual semantic analysis (Gemini API batch or standard).

install
source · Clone the upstream repo
git clone https://github.com/TyrealQ/q-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TyrealQ/q-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/q-scholar/q-multimodal" ~/.claude/skills/tyrealq-q-skills-q-multimodal && rm -rf "$T"
manifest: skills/q-scholar/q-multimodal/SKILL.md
source content

Q-Multimodal

Multimodal media analysis: local low-level features (Pillow, openSMILE), mid/high-level visual semantic analysis (Gemini API). Local pipelines are fully generic and CLI-driven. Gemini pipelines are config-driven: copy

scripts/gemini/pipeline_config.py
to your project, customize, and run with
--config <path>
.

Setup (first time in a project)

Do this once when adopting the skill in a new project. The canonical layout is a target only for

scripts/
and
output/
— user assets (input data, media,
.env
, system prompt) can stay wherever they already live; point the scripts at them via absolute paths.

  • Identify
    <BASE_DIR>
    : read the project's CLAUDE.md if it exists. If
    BASE_DIR
    isn't defined, ask the user which directory is the project root.
  • Locate existing user assets — do not move them. Search, confirm each location with the user before proceeding:
    • Input dataset file (xlsx/csv/json/parquet)
    • Media directory (grouping structure — one subfolder per subject is ideal)
    • .env
      with
      GOOGLE_API_KEY1
      -
      4
      (Gemini only; check project root, home directory, common locations)
    • System prompt file (Gemini only)
  • Default: point at files in place. Set
    pipeline_config.py
    fields or CLI
    --input
    /
    --base-dir
    arguments to the absolute paths you found. Never move user data without explicit confirmation.
  • Materialize only
    scripts/
    and
    output/
    under
    <BASE_DIR>
    . Copy the pipelines actually being used from
    ${SKILL_DIR}/scripts/
    into
    <BASE_DIR>/scripts/
    :
    • Local pipelines:
      pillow/
      ,
      opensmile/
      ,
      common.py
    • Gemini pipelines:
      gemini/batch/
      ,
      gemini/standard/
      ,
      gemini/pipeline_config.py
      (template → adapt in place or copy to
      <BASE_DIR>/scripts/pipeline_config.py
      )
    • output/
      is auto-created by scripts on first run
  • Scan input columns (adapt reader to file format):
    python -c "import pandas as pd; print(list(pd.read_<FORMAT>('INPUT', nrows=1).columns))"
  • Confirm
    --id-cols
    with the user. By default only the identifier column (from
    --file-col
    ) is kept; the user may want additional columns carried through to the output.

References

Read the relevant reference file before executing a pipeline. These contain all flags, output column definitions, edge cases, and validation rules.

Local pipelines:

  • references/image-visual-features.md
    — all feature categories, column definitions, computation notes
  • references/video-visual-features.md
    — frame extraction, aggregation logic, dual output format
  • references/audio-features.md
    — openSMILE feature sets, interpretable scores, feature levels

Gemini pipelines:

  • references/gemini-batch-workflow.md
    — full 6-step batch pipeline, retry workflow, error handling
  • references/gemini-standard.md
    — standard pipeline details, model config, adapting for new projects
  • references/multi-key-management.md
    — multi-key quota strategy, retry threshold decision table

Shared:

  • references/checkpoint-format.md
    — column order, validation rules, output directory structure

Dependencies

PipelinePython packagesSystem
Image visual
Pillow
,
numpy
,
pandas
,
tqdm
,
openpyxl
Video visual(same as image) +
scenedetect[opencv]
ffmpeg
on PATH (for
--extractor ffmpeg
)
Audio
opensmile
,
pandas
,
tqdm
,
openpyxl
ffmpeg
on PATH
Gemini
google-genai
,
python-dotenv
(+ above)
.env
with
GOOGLE_API_KEY1
-
4

Pipelines

Script path =

${SKILL_DIR}/scripts/<path>
. Read the pipeline's reference file before running.

Local Pipelines (generic, CLI-driven)

ScriptInputOutputReference
pillow/visual_features.py
Images47 pixel features (color, texture, spatial, quality)
image-visual-features.md
pillow/video_features.py
VideosFrame-level + video-level aggregated features (scene-based extraction by default, FFmpeg fixed-interval optional)
video-visual-features.md
opensmile/audio_features.py
Video/audio8 interpretable scores + raw openSMILE features
audio-features.md

Shared utilities:

common.py
read_input()
,
save_excel()
,
derive_subject()
,
merge_checkpoints()

Command pattern:

python <script> --input <file> --base-dir <root> [--features ...] [--id-cols ...] [--subjects ...] [--preview] [--merge]

Gemini Pipelines (config-driven)

Both pipelines read a

pipeline_config.py
file that defines paths, schema, metadata formatting, and validation rules. Copy
scripts/gemini/pipeline_config.py
to your project and customize.

Standard (

gemini/standard/gemini_standard.py
): inline media, 25 workers, auto-retry. See
gemini-standard.md
.

Batch (

gemini/batch/[0-5]*.py
+
utils.py
): 6-step pipeline, 50% discount. URIs expire after 48 hours. See
gemini-batch-workflow.md
.

python 0uploadMedia.py --config /path/to/config.py --submit --max-batch-gb 2 --key 1
python 3checkStatus.py --config /path/to/config.py --poll
python 4retryErrors.py --config /path/to/config.py --preview
# >500 failures: batch retry
python 4retryErrors.py --config /path/to/config.py --submit
python 3checkStatus.py --config /path/to/config.py --poll
python 4retryErrors.py --config /path/to/config.py --collect
# <=500 failures or after batch retries: live fallback
python 4retryErrors.py --config /path/to/config.py --standard
python 5review.py --config /path/to/config.py --merge

Decision: >5 GB or >10 subjects and not time-sensitive → batch. Otherwise → standard. See

gemini-standard.md
.

Multi-key: Each

GOOGLE_API_KEY{N}
= 20 GB quota. See
multi-key-management.md
.

Adapting for New Projects

Local pipelines (Pillow, openSMILE): No modification needed. All project-specific values come from CLI args.

Gemini pipelines: Config-driven, no script modification needed. Scripts are copied to the project in step 2 above, then:

  • Adapt
    <BASE_DIR>/scripts/pipeline_config.py
    (already copied from template in step 2)
  • Set
    BASE_DIR
    ,
    INPUT_PATH
    ,
    SYSTEM_PROMPT_PATH
    to your project paths (
    SYSTEM_PROMPT_PATH
    relative to
    BASE_DIR
    , e.g.,
    scripts/<prompt>.txt
    )
  • Set
    GROUP_COL
    ,
    FILE_COL
    ,
    ANALYSIS_FIELDS
    to match your input schema and system prompt
  • Implement
    subject_id()
    and
    format_metadata()
    for your domain
  • Optionally implement
    validate_row()
    for field-specific validation rules
  • Run any script with
    --config <BASE_DIR>/scripts/pipeline_config.py

Scope

Include: Image/video/audio feature extraction, Gemini visual semantic analysis, batch job management, checkpoint merging, multi-key quota management.

Exclude: ML model training, deep learning inference, real-time streaming analysis.

Checklist

  • Read project CLAUDE.md for paths and column names
  • Confirm
    --id-cols
    and
    --features
    with user
  • --preview
    dry run confirms expected subjects and counts
  • Extraction completed with 0 or acceptable failures
  • For Gemini:
    .env
    with API keys, system prompt file created