Awesome-omni-skill pdf-smart-split
Intelligent PDF splitting skill. Split multi-page scanned PDF documents into separate files based on content analysis. Features: (1) Convert PDF to high-quality images (2) Analyze and intelligently name images by content (3) Auto-detect and remove duplicate pages (4) Identify and reorder scrambled pages (5) Merge related images into separate PDFs by content (6) Generate summary reports. Use for: bond filing documents, application materials, contracts and agreements that need intelligent content-based splitting.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/content-media/pdf-smart-split" ~/.claude/skills/diegosouzapw-awesome-omni-skill-pdf-smart-split && rm -rf "$T"
skills/content-media/pdf-smart-split/SKILL.mdPDF Smart Split Skill
Split multi-page scanned PDF documents into separate files based on content analysis. Supports duplicate detection, page reordering, and intelligent naming.
Workflow
1. PDF to Images
from pdf2image import convert_from_path images = convert_from_path(pdf_path, dpi=200, thread_count=4)
- DPI=200 for quality
- Multi-threading for speed
2. Duplicate Detection
import hashlib def get_image_hash(path): with open(path, 'rb') as f: return hashlib.md5(f.read()).hexdigest()
- Use MD5 hash comparison
- Record duplicates for summary
3. Content Analysis and Naming
Use
view_file tool to analyze each image:
- File type (cover, content, signature page)
- Document category (application, agreement, statement)
- Page sequence number
Naming format:
sequence_filetype_description.jpg
4. Page Reordering
Analyze content continuity:
- Check table sequence numbers
- Check chapter numbering progression
- Ensure signature pages follow content
Reorder by logical content, not physical order.
5. Merge Images to PDF
from PIL import Image def images_to_pdf(image_paths, output_pdf): images = [Image.open(p).convert('RGB') for p in image_paths] images[0].save(output_pdf, "PDF", save_all=True, append_images=images[1:])
- Merge signature pages with their content
- Combine continuous content into single files
6. Generate Summary
Summary should include:
- Source file info (name, original pages)
- Duplicate removal records (if any)
- Page reorder records (if any)
- Split file list (name, original pages, page count, description)
- Output directory paths
Script Usage
Full Processing
python scripts/convert_dedup.py <pdf_path> [output_dir] python scripts/merge_images.py <image1> <image2> ... <output.pdf>
Output Structure
filename_images/ # Intelligently named images filename_split_PDF/ # Content-grouped PDFs
Requirements
- Install:
,pdf2imagePillow - Windows needs
:popplerconda install -c conda-forge poppler