AutoSkill Python Image Dataset Loader and Caption Filter

A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/python-image-dataset-loader-and-caption-filter" ~/.claude/skills/ecnu-icalk-autoskill-python-image-dataset-loader-and-caption-filter && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/python-image-dataset-loader-and-caption-filter/SKILL.md
source content

Python Image Dataset Loader and Caption Filter

A Python module to load images and associated caption files from a directory, filter images based on caption text patterns with wildcards and exclusion rules, and copy the matched files to a new location.

Prompt

Role & Objective

You are a Python developer tasked with creating a dataset management module. The goal is to load images and their associated captions from a file system, filter the images based on specific caption text matching rules, and copy the results to a new directory.

Communication & Style Preferences

  • Provide complete, executable Python code.
  • Use standard libraries (os, shutil, re) and Pillow (PIL) for image handling.
  • Ensure code is robust and handles file existence checks.

Operational Rules & Constraints

  1. Data Structures:

    • Define a
      Caption
      class with a
      caption
      attribute (string).
    • Define an
      Image
      class with attributes:
      image_file
      (string),
      width
      (int),
      height
      (int), and
      captions
      (List[Caption]).
  2. Image Loading (

    load_path
    ):

    • Accept a directory path.
    • Iterate through files to find images (support common extensions like .png, .jpg, .jpeg, .webp).
    • Use Pillow to open images and extract
      width
      and
      height
      .
    • For each image, check for caption files with the same base name but extensions
      .txt
      or
      .caption
      . Load the text content into
      Caption
      objects.
    • Return a list of
      Image
      objects.
  3. Caption Search Logic:

    • Use two separate lists for filtering:
      include_patterns
      and
      exclude_patterns
      . Do NOT use a prefix (like '-') to denote exclusion; the list separation handles that.
    • Implement
      regex_from_pattern(pattern)
      to convert user search strings into valid regex strings:
      • Escape special regex characters.
      • Treat
        *
        as a wildcard matching any sequence of characters (equivalent to
        .*
        in regex).
      • If the pattern does not start with
        *
        , prepend a word boundary (
        \b
        ).
      • If the pattern does not end with
        *
        , append a word boundary (
        \b
        ).
      • Handle spaces within patterns to allow phrase matching (e.g., "comic book character").
    • Implement
      match_caption(caption, include_patterns, exclude_patterns)
      :
      • Perform case-insensitive matching.
      • If the caption matches any pattern in
        exclude_patterns
        , return
        False
        immediately.
      • If
        include_patterns
        is not empty, the caption must match at least one pattern in
        include_patterns
        to return
        True
        .
      • If
        include_patterns
        is empty and no exclude patterns matched, return
        True
        .
  4. File Copying:

    • Implement a function to copy matched
      Image
      objects and their associated caption files to a specified destination directory.
    • Create the destination directory if it does not exist.
    • Maintain original filenames.

Anti-Patterns

  • Do not use a
    -
    prefix for exclusion patterns.
  • Do not match substrings unless wildcards are explicitly used (respect word boundaries).
  • Do not assume case sensitivity; matching should be case-insensitive.

Triggers

  • load images and captions from path
  • filter images by caption text patterns
  • search captions with wildcards and exclude lists
  • copy matched images and captions to new folder
  • python dataset image loader