Open-skills pdf-manipulation
Manipulate PDF files including merge, split, extract, redact, convert, and secure workflows.
install
source · Clone the upstream repo
git clone https://github.com/besoeasy/open-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/besoeasy/open-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pdf-manipulation" ~/.claude/skills/besoeasy-open-skills-pdf-manipulation && rm -rf "$T"
manifest:
skills/pdf-manipulation/SKILL.mdsource content
PDF Manipulation Skill
Merge, split, extract, redact, and transform PDF files using free command-line tools and libraries. Covers common PDF operations for document automation workflows.
When to use
- Merge multiple PDFs into one document
- Split large PDFs into separate files or page ranges
- Extract text, images, or specific pages
- Redact sensitive information
- Add watermarks, passwords, or metadata
- Convert PDFs to images or other formats
Required tools
- pdftk — Swiss Army knife for PDF manipulation (merge, split, rotate, encrypt)
- qpdf — PDF transformation and encryption (linearize, decrypt, repair)
- pdftotext / pdfimages — Part of poppler-utils (extract text and images)
- ghostscript (gs) — Advanced PDF processing, compression, and conversion
Installation
# Ubuntu/Debian sudo apt-get install pdftk qpdf poppler-utils ghostscript # macOS (Homebrew) brew install pdftk-java qpdf poppler ghostscript # For Node.js: npm i pdf-lib (pure JS, no system deps) # For Python: pip install PyPDF2 pypdf
Skills
Merge PDFs
# Using pdftk (preserves bookmarks, forms) pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf # Using ghostscript (better compression) gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf # Using qpdf (preserves structure) qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf
Node.js (pdf-lib):
const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function mergePDFs(files, output) { const mergedPdf = await PDFDocument.create(); for (const file of files) { const pdfBytes = fs.readFileSync(file); const pdf = await PDFDocument.load(pdfBytes); const pages = await mergedPdf.copyPages(pdf, pdf.getPageIndices()); pages.forEach(page => mergedPdf.addPage(page)); } const mergedBytes = await mergedPdf.save(); fs.writeFileSync(output, mergedBytes); } // mergePDFs(['file1.pdf', 'file2.pdf'], 'merged.pdf');
Split PDF (by page or range)
# Split every page into separate files pdftk input.pdf burst output page_%02d.pdf # Extract specific pages (e.g., pages 1-5 and 10) pdftk input.pdf cat 1-5 10 output subset.pdf # Extract page ranges with qpdf qpdf input.pdf --pages . 1-5 -- output.pdf # Split every N pages (e.g., every 2 pages) pdftk input.pdf burst # then manually combine or script it
Node.js (pdf-lib):
const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function extractPages(inputPath, pages, outputPath) { const pdfBytes = fs.readFileSync(inputPath); const pdfDoc = await PDFDocument.load(pdfBytes); const newPdf = await PDFDocument.create(); for (const pageNum of pages) { const [page] = await newPdf.copyPages(pdfDoc, [pageNum - 1]); newPdf.addPage(page); } const newBytes = await newPdf.save(); fs.writeFileSync(outputPath, newBytes); } // extractPages('input.pdf', [1, 3, 5], 'output.pdf');
Extract text
# Extract all text (preserves layout) pdftotext input.pdf output.txt # Extract text as raw (no layout) pdftotext -raw input.pdf output.txt # Extract specific pages pdftotext -f 1 -l 5 input.pdf output.txt # Using qpdf + pdftotext pdftotext -layout input.pdf -
Node.js (pdf-parse):
const fs = require('fs'); const pdf = require('pdf-parse'); async function extractText(filePath) { const dataBuffer = fs.readFileSync(filePath); const data = await pdf(dataBuffer); return data.text; } // extractText('input.pdf').then(console.log);
Extract images
# Extract all images from PDF pdfimages -all input.pdf output_prefix # Output: output_prefix-000.png, output_prefix-001.jpg, etc. # Extract only JPEGs pdfimages -j input.pdf output_prefix
Redact / Remove pages
# Remove specific pages (e.g., remove pages 2-4) pdftk input.pdf cat 1 5-end output redacted.pdf # Keep only specific pages pdftk input.pdf cat 1-10 20-30 output selected.pdf
Add password protection
# Encrypt PDF with password pdftk input.pdf output secured.pdf user_pw mypassword # Remove password pdftk secured.pdf input_pw mypassword output unlocked.pdf # Using qpdf (AES-256) qpdf --encrypt userpass ownerpass 256 -- input.pdf output.pdf
Node.js (pdf-lib):
const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); async function encryptPDF(inputPath, password, outputPath) { const pdfBytes = fs.readFileSync(inputPath); const pdfDoc = await PDFDocument.load(pdfBytes); const encryptedBytes = await pdfDoc.save({ userPassword: password, ownerPassword: password }); fs.writeFileSync(outputPath, encryptedBytes); }
Rotate pages
# Rotate all pages 90 degrees clockwise pdftk input.pdf cat 1-endright output rotated.pdf # Rotate specific pages pdftk input.pdf cat 1-5 6right 7-end output rotated.pdf # Options: right (90°), left (270°), down (180°)
Compress / Reduce file size
# Using ghostscript (adjust quality) gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \ -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed.pdf input.pdf # Quality settings: # /screen - low quality (72 dpi) # /ebook - medium (150 dpi) # /printer - high (300 dpi) # /prepress - highest (300 dpi, preserves color) # Using qpdf (lossless compression) qpdf --linearize --object-streams=generate input.pdf compressed.pdf
Convert PDF to images
# Convert each page to PNG (300 DPI) pdftoppm -png -r 300 input.pdf output_prefix # Output: output_prefix-1.png, output_prefix-2.png, etc. # Convert to JPEG pdftoppm -jpeg -r 150 input.pdf output_prefix # Using ImageMagick (alternative) convert -density 300 input.pdf output_%03d.png
Add watermark
# Overlay watermark.pdf on every page pdftk input.pdf stamp watermark.pdf output watermarked.pdf # Background watermark (behind content) pdftk input.pdf background watermark.pdf output watermarked.pdf # Watermark specific pages only pdftk input.pdf multistamp watermark.pdf output watermarked.pdf
Get PDF metadata
# Using pdftk pdftk input.pdf dump_data # Using qpdf qpdf --show-object=1 input.pdf # Using pdfinfo (poppler-utils) pdfinfo input.pdf
Multi-operation script (Node.js)
const { PDFDocument } = require('pdf-lib'); const fs = require('fs'); class PDFHelper { static async merge(files, output) { const merged = await PDFDocument.create(); for (const file of files) { const pdf = await PDFDocument.load(fs.readFileSync(file)); const pages = await merged.copyPages(pdf, pdf.getPageIndices()); pages.forEach(p => merged.addPage(p)); } fs.writeFileSync(output, await merged.save()); } static async split(input, ranges, output) { const pdf = await PDFDocument.load(fs.readFileSync(input)); const newPdf = await PDFDocument.create(); const pages = await newPdf.copyPages(pdf, ranges); pages.forEach(p => newPdf.addPage(p)); fs.writeFileSync(output, await newPdf.save()); } static async info(input) { const pdf = await PDFDocument.load(fs.readFileSync(input)); return { pages: pdf.getPageCount(), title: pdf.getTitle(), author: pdf.getAuthor(), creator: pdf.getCreator() }; } } module.exports = PDFHelper;
Agent prompt
You have PDF manipulation skills. When a user requests PDF operations: 1. Detect the operation: merge, split, extract (text/images/pages), redact, compress, encrypt, rotate, watermark, or get info. 2. Use appropriate tools: - pdftk for merge, split, rotate, encrypt, watermark - pdftotext/pdfimages for extraction - ghostscript for compression - qpdf for repair and advanced operations 3. Always validate input files exist before processing. 4. For scripting, prefer pdf-lib (Node.js) or PyPDF2 (Python) for portability. 5. Return structured output (file paths, metadata, text) in JSON format.
Best practices
- Validate PDFs before processing (use
).qpdf --check input.pdf - Preserve metadata when possible (use pdftk or pdf-lib, avoid ghostscript for simple operations).
- Use appropriate compression — ghostscript
is a good balance for most cases./ebook - Security — Always remove passwords before processing if user provides them; never log passwords.
- Large files — For 100+ page PDFs, process in chunks or use streaming APIs.
Common workflows
Invoice processing
# 1. Extract text for parsing pdftotext invoice.pdf invoice.txt # 2. Extract first page only (summary) pdftk invoice.pdf cat 1 output summary.pdf # 3. Compress for archival gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dBATCH -dNOPAUSE -q \ -sOutputFile=invoice_compressed.pdf invoice.pdf
Batch processing
# Merge all PDFs in a directory pdftk *.pdf cat output combined.pdf # Split each PDF in directory into individual pages for f in *.pdf; do pdftk "$f" burst output "${f%.pdf}_page_%02d.pdf" done # Extract text from all PDFs for f in *.pdf; do pdftotext "$f" "${f%.pdf}.txt" done
Troubleshooting
- Corrupted PDF: Use
thenqpdf --check
to repair.qpdf input.pdf --replace-input - Encrypted PDF: Remove password first with
.qpdf --decrypt --password=PASS input.pdf output.pdf - Large file size: Use ghostscript compression or remove embedded fonts/images if not needed.
- Missing fonts: Install
orfonts-liberation
packages.msttcorefonts
See also
- anonymous-file-upload.md — Upload processed PDFs anonymously.
- using-web-scraping.md — Scrape web pages and convert to PDF.