AutoSkill Python Image Dataset Loader and Caption Filter
A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/python-image-dataset-loader-and-caption-filter" ~/.claude/skills/ecnu-icalk-autoskill-python-image-dataset-loader-and-caption-filter && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/python-image-dataset-loader-and-caption-filter/SKILL.mdsource content
Python Image Dataset Loader and Caption Filter
A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location.
Prompt
Role & Objective
You are a Python developer tasked with creating a dataset management module. The goal is to load images and their associated captions from a file system, filter the images based on specific caption text matching rules, and copy the results to a new directory.
Communication & Style Preferences
- Provide complete, executable Python code.
- Use standard libraries (os, shutil, re) and Pillow (PIL) for image handling.
- Ensure code is robust and handles file existence checks.
Operational Rules & Constraints
-
Data Structures:
- Define a
class with aCaption
attribute (string).caption - Define an
class with attributes:Image
(string),image_file
(int),width
(int), andheight
(List[Caption]).captions
- Define a
-
Image Loading (
):load_path- Accept a directory path.
- Iterate through files to find images (support common extensions like .png, .jpg, .jpeg, .webp).
- Use Pillow to open images and extract
andwidth
.height - For each image, check for caption files with the same base name but extensions
or.txt
. Load the text content into.caption
objects.Caption - Return a list of
objects.Image
-
Caption Search Logic:
- Use two separate lists for filtering:
andinclude_patterns
. Do NOT use a prefix (like '-') to denote exclusion; the list separation handles that.exclude_patterns - Implement
to convert user search strings into valid regex strings:regex_from_pattern(pattern)- Escape special regex characters.
- Treat
as a wildcard matching any sequence of characters (equivalent to*
in regex)..* - If the pattern does not start with
, prepend a word boundary (*
).\b - If the pattern does not end with
, append a word boundary (*
).\b - Handle spaces within patterns to allow phrase matching (e.g., "comic book character").
- Implement
:match_caption(caption, include_patterns, exclude_patterns)- Perform case-insensitive matching.
- If the caption matches any pattern in
, returnexclude_patterns
immediately.False - If
is not empty, the caption must match at least one pattern ininclude_patterns
to returninclude_patterns
.True - If
is empty and no exclude patterns matched, returninclude_patterns
.True
- Use two separate lists for filtering:
-
File Copying:
- Implement a function to copy matched
objects and their associated caption files to a specified destination directory.Image - Create the destination directory if it does not exist.
- Maintain original filenames.
- Implement a function to copy matched
Anti-Patterns
- Do not use a
prefix for exclusion patterns.- - Do not match substrings unless wildcards are explicitly used (respect word boundaries).
- Do not assume case sensitivity; matching should be case-insensitive.
Triggers
- load images and captions from path
- filter images by caption text patterns
- search captions with wildcards and exclude lists
- copy matched images and captions to new folder
- python dataset image loader