Awesome-omni-skill pdf-smart-split

Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/content-media/pdf-smart-split" ~/.claude/skills/diegosouzapw-awesome-omni-skill-pdf-smart-split && rm -rf "$T"
manifest: skills/content-media/pdf-smart-split/SKILL.md
source content

PDF Smart Split Skill

Split multi-page scanned PDF documents into separate files based on content analysis. Supports duplicate detection, page reordering, and intelligent naming.

Workflow

1. PDF to Images

from pdf2image import convert_from_path
images = convert_from_path(pdf_path, dpi=200, thread_count=4)
  • DPI=200 for quality
  • Multi-threading for speed

2. Duplicate Detection

import hashlib
def get_image_hash(path):
    with open(path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()
  • Use MD5 hash comparison
  • Record duplicates for summary

3. Content Analysis and Naming

Use

view_file
tool to analyze each image:

  • File type (cover, content, signature page)
  • Document category (application, agreement, statement)
  • Page sequence number

Naming format:

sequence_filetype_description.jpg

4. Page Reordering

Analyze content continuity:

  • Check table sequence numbers
  • Check chapter numbering progression
  • Ensure signature pages follow content

Reorder by logical content, not physical order.

5. Merge Images to PDF

from PIL import Image
def images_to_pdf(image_paths, output_pdf):
    images = [Image.open(p).convert('RGB') for p in image_paths]
    images[0].save(output_pdf, "PDF", save_all=True, append_images=images[1:])
  • Merge signature pages with their content
  • Combine continuous content into single files

6. Generate Summary

Summary should include:

  • Source file info (name, original pages)
  • Duplicate removal records (if any)
  • Page reorder records (if any)
  • Split file list (name, original pages, page count, description)
  • Output directory paths

Script Usage

Full Processing

python scripts/convert_dedup.py <pdf_path> [output_dir]
python scripts/merge_images.py <image1> <image2> ... <output.pdf>

Output Structure

filename_images/     # Intelligently named images
filename_split_PDF/  # Content-grouped PDFs

Requirements

  • Install:
    pdf2image
    ,
    Pillow
  • Windows needs
    poppler
    :
    conda install -c conda-forge poppler