Skills pdf
install
source · Clone the upstream repo
git clone https://github.com/wulaosiji/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/wulaosiji/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/pdf" ~/.claude/skills/wulaosiji-skills-pdf && rm -rf "$T"
manifest:
pdf/SKILL.mdsource content
PDF Processing Guide
Comprehensive PDF processing operations using Python libraries and command-line tools.
When to Use
Use This Skill When
- 需要提取PDF中的文本或表格数据
- 合并多个PDF文件为一个
- 将PDF拆分为单页文件
- 旋转PDF页面方向
- 为PDF添加水印
- 创建新的PDF文档
- 处理PDF表单填写
- 对PDF进行加密或解密
- OCR识别扫描版PDF
- 从PDF中提取图片
Do NOT Use This Skill If
- PDF文件被密码保护且无密码
- 需要复杂的PDF编辑(如修改现有内容)
- PDF文件损坏无法读取
- 需要保留原始PDF的复杂排版
Typical Trigger Phrases
Chinese:
- "提取PDF文字"
- "合并PDF文件"
- "拆分PDF"
- "PDF加水印"
- "PDF转Word"
- "扫描PDF识别"
English:
- "Extract PDF text"
- "Merge PDF files"
- "Split PDF"
- "Add watermark to PDF"
- "Convert PDF to Word"
- "OCR scanned PDF"
Workflow
Step 1: 确定PDF操作类型
| 操作类型 | 推荐工具 | 复杂度 |
|---|---|---|
| 文本提取 | pdfplumber | 简单 |
| 表格提取 | pdfplumber | 中等 |
| 合并/拆分 | pypdf / qpdf | 简单 |
| 创建PDF | reportlab | 中等 |
| OCR识别 | pytesseract | 复杂 |
Step 2: 选择合适工具
- pypdf: 基础操作(合并、拆分、元数据)
- pdfplumber: 文本和表格提取
- reportlab: 创建PDF
- qpdf: 命令行高级操作
Step 3: 执行操作
from pypdf import PdfReader, PdfWriter # 或 import pdfplumber
Step 4: 验证结果
- 检查输出文件完整性
- 验证提取的文本/数据准确性
- 确认格式保持正确
Guardrails
Anti-Patterns
- ❌ 使用Unicode上下标字符(会导致黑框)
- ❌ 不验证提取的表格数据
- ❌ 忽略PDF版本兼容性问题
- ❌ 处理大型PDF时不分页处理
Limitations
- 扫描版PDF需要OCR才能提取文本
- 复杂排版可能丢失格式
- 某些PDF字体嵌入问题
- 加密PDF需要密码
Important Notes
- Subscripts/Superscripts: 使用
和<sub>
标签,不要用Unicode字符<super> - Table Extraction: 复杂表格可能需要手动调整
- OCR Quality: 依赖图片清晰度
Quick Start
from pypdf import PdfReader, PdfWriter # Read a PDF reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}") # Extract text text = "" for page in reader.pages: text += page.extract_text()
Python Libraries
pypdf - Basic Operations
Merge PDFs
from pypdf import PdfWriter, PdfReader writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open("merged.pdf", "wb") as output: writer.write(output)
Split PDF
reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output)
Extract Metadata
reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}")
Rotate Pages
reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.add_page(page) with open("rotated.pdf", "wb") as output: writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
import pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)
Extract Tables
with pdfplumber.open("document.pdf") as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for j, table in enumerate(tables): print(f"Table {j+1} on page {i+1}:") for row in table: print(row)
Advanced Table Extraction
import pandas as pd with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() for table in tables: if table: df = pd.DataFrame(table[1:], columns=table[0]) all_tables.append(df) if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
Basic PDF Creation
from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter c.drawString(100, height - 100, "Hello World!") c.line(100, height - 140, 400, height - 140) c.save()
Multi-Page PDF
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet doc = SimpleDocTemplate("report.pdf", pagesize=letter) styles = getSampleStyleSheet() story = [] story.append(Paragraph("Report Title", styles['Title'])) story.append(Spacer(1, 12)) story.append(Paragraph("Body content", styles['Normal'])) story.append(PageBreak()) story.append(Paragraph("Page 2", styles['Heading1'])) doc.build(story)
Subscripts and Superscripts
IMPORTANT: Never use Unicode subscript/superscript characters. Use ReportLab's XML markup:
from reportlab.platypus import Paragraph # Subscripts: use <sub> tag chemical = Paragraph("H<sub>2</sub>O", styles['Normal']) # Superscripts: use <super> tag squared = Paragraph("x<super>2</super>", styles['Normal'])
Command-Line Tools
pdftotext (poppler-utils)
# Extract text pdftotext input.pdf output.txt # Preserve layout pdftotext -layout input.pdf output.txt # Specific pages pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
# Merge PDFs qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf # Split pages qpdf input.pdf --pages . 1-5 -- pages1-5.pdf # Rotate pages qpdf input.pdf output.pdf --rotate=+90:1 # Remove password qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
Common Tasks
OCR on Scanned PDFs
import pytesseract from pdf2image import convert_from_path images = convert_from_path('scanned.pdf') text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
Add Watermark
from pypdf import PdfReader, PdfWriter watermark = PdfReader("watermark.pdf").pages[0] reader = PdfReader("document.pdf") writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page) with open("watermarked.pdf", "wb") as output: writer.write(output)
Extract Images
pdfimages -j input.pdf output_prefix
Password Protection
from pypdf import PdfReader, PdfWriter reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) writer.encrypt("userpassword", "ownerpassword") with open("encrypted.pdf", "wb") as output: writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code |
|---|---|---|
| Merge PDFs | pypdf | |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | |
| Extract tables | pdfplumber | |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | |
| OCR scanned PDFs | pytesseract | Convert to image first |
Related Skills
| Skill | Relationship | Use Case |
|---|---|---|
| document-hub | 上级封装 | Word/Excel与PDF互转 |
| image-ocr | 辅助工具 | 扫描PDF的OCR识别 |
| content-extractor | 内容来源 | 提取内容生成PDF |
| email-sender | 下游分发 | 发送PDF附件 |
About UniqueClub
Part of the UniqueClub toolkit - a collection of skills for AI-powered content creation and automation.