Q-skills q-multimodal
Extract visual, video, and audio features from media. Use for pixel features (Pillow), video frames (FFmpeg+Pillow), audio features (openSMILE), and visual semantic analysis (Gemini API batch or standard).
git clone https://github.com/TyrealQ/q-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/TyrealQ/q-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/q-scholar/q-multimodal" ~/.claude/skills/tyrealq-q-skills-q-multimodal && rm -rf "$T"
skills/q-scholar/q-multimodal/SKILL.mdQ-Multimodal
Multimodal media analysis: local low-level features (Pillow, openSMILE), mid/high-level visual semantic analysis (Gemini API). Local pipelines are fully generic and CLI-driven. Gemini pipelines are config-driven: copy
scripts/gemini/pipeline_config.py to your project, customize, and run with --config <path>.
Setup (first time in a project)
Do this once when adopting the skill in a new project. The canonical layout is a target only for
scripts/ and output/ — user assets (input data, media, .env, system prompt) can stay wherever they already live; point the scripts at them via absolute paths.
- Identify
: read the project's CLAUDE.md if it exists. If<BASE_DIR>
isn't defined, ask the user which directory is the project root.BASE_DIR - Locate existing user assets — do not move them. Search, confirm each location with the user before proceeding:
- Input dataset file (xlsx/csv/json/parquet)
- Media directory (grouping structure — one subfolder per subject is ideal)
with.env
-GOOGLE_API_KEY1
(Gemini only; check project root, home directory, common locations)4- System prompt file (Gemini only)
- Default: point at files in place. Set
fields or CLIpipeline_config.py
/--input
arguments to the absolute paths you found. Never move user data without explicit confirmation.--base-dir - Materialize only
andscripts/
underoutput/
. Copy the pipelines actually being used from<BASE_DIR>
into${SKILL_DIR}/scripts/
:<BASE_DIR>/scripts/- Local pipelines:
,pillow/
,opensmile/common.py - Gemini pipelines:
,gemini/batch/
,gemini/standard/
(template → adapt in place or copy togemini/pipeline_config.py
)<BASE_DIR>/scripts/pipeline_config.py
is auto-created by scripts on first runoutput/
- Local pipelines:
- Scan input columns (adapt reader to file format):
python -c "import pandas as pd; print(list(pd.read_<FORMAT>('INPUT', nrows=1).columns))" - Confirm
with the user. By default only the identifier column (from--id-cols
) is kept; the user may want additional columns carried through to the output.--file-col
References
Read the relevant reference file before executing a pipeline. These contain all flags, output column definitions, edge cases, and validation rules.
Local pipelines:
— all feature categories, column definitions, computation notesreferences/image-visual-features.md
— frame extraction, aggregation logic, dual output formatreferences/video-visual-features.md
— openSMILE feature sets, interpretable scores, feature levelsreferences/audio-features.md
Gemini pipelines:
— full 6-step batch pipeline, retry workflow, error handlingreferences/gemini-batch-workflow.md
— standard pipeline details, model config, adapting for new projectsreferences/gemini-standard.md
— multi-key quota strategy, retry threshold decision tablereferences/multi-key-management.md
Shared:
— column order, validation rules, output directory structurereferences/checkpoint-format.md
Dependencies
| Pipeline | Python packages | System |
|---|---|---|
| Image visual | , , , , | — |
| Video visual | (same as image) + | on PATH (for ) |
| Audio | , , , | on PATH |
| Gemini | , (+ above) | with - |
Pipelines
Script path =
${SKILL_DIR}/scripts/<path>. Read the pipeline's reference file before running.
Local Pipelines (generic, CLI-driven)
| Script | Input | Output | Reference |
|---|---|---|---|
| Images | 47 pixel features (color, texture, spatial, quality) | |
| Videos | Frame-level + video-level aggregated features (scene-based extraction by default, FFmpeg fixed-interval optional) | |
| Video/audio | 8 interpretable scores + raw openSMILE features | |
Shared utilities:
common.py — read_input(), save_excel(), derive_subject(), merge_checkpoints()
Command pattern:
python <script> --input <file> --base-dir <root> [--features ...] [--id-cols ...] [--subjects ...] [--preview] [--merge]
Gemini Pipelines (config-driven)
Both pipelines read a
pipeline_config.py file that defines paths, schema, metadata formatting, and validation rules. Copy scripts/gemini/pipeline_config.py to your project and customize.
Standard (
gemini/standard/gemini_standard.py): inline media, 25 workers, auto-retry. See gemini-standard.md.
Batch (
gemini/batch/[0-5]*.py + utils.py): 6-step pipeline, 50% discount. URIs expire after 48 hours. See gemini-batch-workflow.md.
python 0uploadMedia.py --config /path/to/config.py --submit --max-batch-gb 2 --key 1 python 3checkStatus.py --config /path/to/config.py --poll python 4retryErrors.py --config /path/to/config.py --preview # >500 failures: batch retry python 4retryErrors.py --config /path/to/config.py --submit python 3checkStatus.py --config /path/to/config.py --poll python 4retryErrors.py --config /path/to/config.py --collect # <=500 failures or after batch retries: live fallback python 4retryErrors.py --config /path/to/config.py --standard python 5review.py --config /path/to/config.py --merge
Decision: >5 GB or >10 subjects and not time-sensitive → batch. Otherwise → standard. See
gemini-standard.md.
Multi-key: Each
GOOGLE_API_KEY{N} = 20 GB quota. See multi-key-management.md.
Adapting for New Projects
Local pipelines (Pillow, openSMILE): No modification needed. All project-specific values come from CLI args.
Gemini pipelines: Config-driven, no script modification needed. Scripts are copied to the project in step 2 above, then:
- Adapt
(already copied from template in step 2)<BASE_DIR>/scripts/pipeline_config.py - Set
,BASE_DIR
,INPUT_PATH
to your project paths (SYSTEM_PROMPT_PATH
relative toSYSTEM_PROMPT_PATH
, e.g.,BASE_DIR
)scripts/<prompt>.txt - Set
,GROUP_COL
,FILE_COL
to match your input schema and system promptANALYSIS_FIELDS - Implement
andsubject_id()
for your domainformat_metadata() - Optionally implement
for field-specific validation rulesvalidate_row() - Run any script with
--config <BASE_DIR>/scripts/pipeline_config.py
Scope
Include: Image/video/audio feature extraction, Gemini visual semantic analysis, batch job management, checkpoint merging, multi-key quota management.
Exclude: ML model training, deep learning inference, real-time streaming analysis.
Checklist
- Read project CLAUDE.md for paths and column names
- Confirm
and--id-cols
with user--features -
dry run confirms expected subjects and counts--preview - Extraction completed with 0 or acceptable failures
- For Gemini:
with API keys, system prompt file created.env