Marketplace pdf
Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation. Use when working with .pdf files for: (1) Extracting text and tables, (2) Filling PDF forms, (3) Merging/splitting PDFs, (4) Creating PDFs programmatically, (5) Adding watermarks/annotations, (6) PDF metadata management
git clone https://github.com/aiskillstore/marketplace
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/autumnsgrove/pdf" ~/.claude/skills/aiskillstore-marketplace-pdf-8243e1 && rm -rf "$T"
skills/autumnsgrove/pdf/SKILL.mdPDF Manipulation Skill
Comprehensive guide for working with PDF files in Python, covering extraction, manipulation, creation, and advanced operations using progressive disclosure for efficiency.
Core Capabilities
Extract and manipulate PDF content:
- Extract text with layout preservation
- Extract tables and parse structured data
- Fill PDF forms programmatically
- Merge multiple PDFs into a single document
- Split PDFs by pages or ranges
- Create PDFs from scratch with text, images, and graphics
- Add watermarks and annotations
- Extract and modify metadata (author, title, keywords)
- Add password protection and encryption
- Perform OCR on scanned documents
- Convert images to PDF
- Compress and optimize PDF files
- Extract images from PDFs
- Rotate and reorder pages
Quick Start
Install required libraries:
pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow
For detailed installation instructions including system dependencies, see:
Python Libraries Overview
pypdf: Basic operations (merge, split, rotate, metadata) pdfplumber: Advanced text/table extraction with layout awareness reportlab: Create PDFs from scratch (reports, invoices, documents) PyMuPDF (fitz): Advanced manipulation, annotations, compression pdf2image: Convert PDF pages to images (requires poppler) pytesseract: OCR for scanned documents (requires tesseract)
Text Extraction Workflow
Basic Extraction
from pypdf import PdfReader reader = PdfReader("document.pdf") for page in reader.pages: text = page.extract_text() print(text)
Layout-Aware Extraction
import pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() words = page.extract_words() # With positioning print(text)
Extract from Specific Region
with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] bbox = (0, 0, 612, 100) # x0, y0, x1, y1 header = page.crop(bbox).extract_text()
For detailed text extraction methods including OCR fallback and encoding handling, see:
Table Extraction Workflow
Extract All Tables
import pdfplumber with pdfplumber.open("report.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: print(table)
Advanced Table Detection
table_settings = { "vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 3 } tables = page.extract_tables(table_settings=table_settings)
For detailed table extraction strategies and data cleaning, see:
PDF Form Operations
Fill Form Fields
import fitz doc = fitz.open("form.pdf") for page in doc: for widget in page.widgets(): if widget.field_name == "name": widget.field_value = "John Doe" widget.update() doc.save("filled.pdf") doc.close()
Extract Form Field Names
doc = fitz.open("form.pdf") for page in doc: for widget in page.widgets(): print(f"{widget.field_name}: {widget.field_type_string}") doc.close()
For form filling, flattening, and debugging, see:
Merging and Splitting
Merge PDFs
from pypdf import PdfMerger merger = PdfMerger() for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]: merger.append(pdf) merger.write("merged.pdf") merger.close()
Merge with Page Ranges
merger = PdfMerger() merger.append("doc1.pdf", pages=(0, 3)) # First 3 pages merger.append("doc2.pdf") # All pages merger.write("compiled.pdf") merger.close()
Split into Individual Pages
from pypdf import PdfReader, PdfWriter reader = PdfReader("document.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", 'wb') as f: writer.write(f)
For merging with bookmarks and splitting by size, see:
Creating PDFs
Simple Text PDF
from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import letter c = canvas.Canvas("output.pdf", pagesize=letter) c.setFont("Helvetica", 12) c.drawString(50, 750, "Hello, World!") c.save()
Styled Report
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer from reportlab.lib.styles import getSampleStyleSheet doc = SimpleDocTemplate("report.pdf") styles = getSampleStyleSheet() story = [] story.append(Paragraph("Report Title", styles['Title'])) story.append(Spacer(1, 12)) story.append(Paragraph("Content here", styles['BodyText'])) doc.build(story)
PDF with Table
from reportlab.platypus import Table, TableStyle from reportlab.lib import colors data = [ ['Product', 'Quantity', 'Price'], ['Widget A', '10', '$50'], ['Widget B', '5', '$75'] ] table = Table(data) table.setStyle(TableStyle([ ('BACKGROUND', (0, 0), (-1, 0), colors.grey), ('GRID', (0, 0), (-1, -1), 1, colors.black) ]))
For complete PDF creation workflows including images, multi-column layouts, and custom fonts, see:
For practical examples:
Metadata and Security
Extract Metadata
from pypdf import PdfReader reader = PdfReader("document.pdf") metadata = reader.metadata print(f"Title: {metadata.get('/Title')}") print(f"Author: {metadata.get('/Author')}")
Modify Metadata
from pypdf import PdfWriter writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.add_metadata({ '/Title': 'New Title', '/Author': 'John Doe' }) with open("updated.pdf", 'wb') as f: writer.write(f)
Add Password Protection
writer.encrypt( user_password="user123", owner_password="owner456", algorithm="AES-256" )
For detailed security operations and comprehensive metadata management, see:
OCR for Scanned Documents
Basic OCR
from pdf2image import convert_from_path import pytesseract images = convert_from_path("scanned.pdf") for i, image in enumerate(images): text = pytesseract.image_to_string(image) print(f"Page {i+1}:\n{text}")
Multi-Language OCR
text = pytesseract.image_to_string(image, lang='eng+fra+deu')
For searchable PDF creation and OCR preprocessing, see:
Watermarks and Annotations
Add Text Watermark
import fitz doc = fitz.open("document.pdf") for page in doc: page.insert_textbox( page.rect, "CONFIDENTIAL", fontsize=50, rotate=45, opacity=0.3, color=(0.7, 0.7, 0.7) ) doc.save("watermarked.pdf") doc.close()
Add Annotations
page.add_highlight_annot(rect) # Highlight page.add_text_annot(point, "Note") # Text note page.add_underline_annot(rect) # Underline
For stamps and image watermarks, see:
Page Operations
Rotate Pages
from pypdf import PdfReader, PdfWriter reader = PdfReader("document.pdf") writer = PdfWriter() for page in reader.pages: page.rotate(90) writer.add_page(page) with open("rotated.pdf", 'wb') as f: writer.write(f)
Extract Images
import fitz doc = fitz.open("document.pdf") for page_num in range(len(doc)): page = doc[page_num] for img_index, img in enumerate(page.get_images()): xref = img[0] base_image = doc.extract_image(xref) with open(f"image_{page_num}_{img_index}.png", "wb") as f: f.write(base_image["image"]) doc.close()
Convert Images to PDF
from PIL import Image from reportlab.pdfgen import canvas c = canvas.Canvas("output.pdf") for img_path in ["img1.jpg", "img2.jpg"]: img = Image.open(img_path) c.setPageSize(img.size) c.drawImage(img_path, 0, 0, width=img.width, height=img.height) c.showPage() c.save()
For detailed page operations, see:
Optimization
Compress PDF
import fitz doc = fitz.open("large.pdf") doc.save( "optimized.pdf", garbage=4, deflate=True, clean=True ) doc.close()
Best Practices
Memory Management
Process large PDFs in chunks:
from pypdf import PdfReader import gc reader = PdfReader("large.pdf") for i, page in enumerate(reader.pages): text = page.extract_text() # Process text if i % 10 == 0: gc.collect()
Error Handling
Always handle encryption and errors:
from pypdf import PdfReader try: reader = PdfReader("document.pdf") if reader.is_encrypted: reader.decrypt(password) for page in reader.pages: text = page.extract_text() except Exception as e: print(f"Error: {e}")
OCR Fallback
Detect and handle scanned documents:
import fitz doc = fitz.open("document.pdf") text = doc[0].get_text() if not text.strip(): # Use OCR for scanned document from pdf2image import convert_from_path import pytesseract images = convert_from_path("document.pdf") text = pytesseract.image_to_string(images[0])
For comprehensive best practices, common pitfalls, and troubleshooting, see:
Common Pitfalls
Scanned Documents: Text extraction returns empty for scanned PDFs. Use OCR (pytesseract).
Table Detection: Tables not detected correctly. Adjust table_settings strategies.
Encrypted PDFs: Operations fail. Check and decrypt with password first.
Form Fields: Can't find field names. Use debug helper to list all fields.
Memory Issues: Large PDFs cause crashes. Process in chunks with garbage collection.
Encoding Issues: Special characters corrupted. Handle with UTF-8 encoding explicitly.
For detailed solutions and debugging strategies, see:
Quick Reference
Text Extraction:
- Simple:
-pypdfpage.extract_text() - Advanced:
-pdfplumber
+page.extract_text()page.extract_words()
Table Extraction:
- Always use:
-pdfplumberpage.extract_tables()
PDF Creation:
- Use:
-reportlab
orcanvas.Canvas()SimpleDocTemplate()
Advanced Operations:
- Use:
- forms, annotations, compressionPyMuPDF (fitz)
OCR:
- Use:
+pytesseractpdf2image
Merging/Splitting:
- Use:
-pypdf
andPdfMerger()PdfWriter()
Helper Scripts
The skill includes helper scripts for common operations:
# See scripts directory for utilities python scripts/pdf_helper.py --help
Additional Resources
Comprehensive References:
- Library Installation - Setup and dependencies
- Text Extraction - All extraction methods
- Table Extraction - Table detection strategies
- PDF Operations - Forms, merge, split, pages
- PDF Creation - Creating PDFs from scratch
- Metadata, Security, OCR - Advanced operations
- Best Practices - Pitfalls and solutions
Practical Examples:
- Invoice Generator - Professional invoice templates
- Report Automation - Automated report generation
Implementation Guidelines
When working with PDFs:
- Choose the right library for your task (see Quick Reference)
- Handle errors with try-except blocks
- Check for encryption before processing
- Use OCR fallback for scanned documents
- Process large files in chunks to manage memory
- Validate input files before operations
- Close documents to free resources:
doc.close()
For production use, always implement proper error handling, validate inputs, and test with various PDF types and versions.