Skilllibrary pdf-extraction
Extract text, tables, and metadata from text-native PDFs using pdfplumber, PyMuPDF (fitz), tabula-py, or camelot with layout-aware parsing and coordinate-based region selection. Use when pulling structured content from PDFs, extracting specific tables, or reading PDF metadata. Do not use for scanned/image PDFs (prefer image-heavy-pdfs) or modifying PDF files (prefer pdf-editor).
git clone https://github.com/merceralex397-collab/skilllibrary
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/15-docs-artifacts-media/pdf-extraction" ~/.claude/skills/merceralex397-collab-skilllibrary-pdf-extraction && rm -rf "$T"
15-docs-artifacts-media/pdf-extraction/SKILL.mdPurpose
Extract text, tables, and metadata from text-native PDFs using pdfplumber, PyMuPDF (fitz), tabula-py, or camelot with layout-aware parsing, coordinate-based region selection, and structural analysis. This skill handles PDFs that have embedded text layers — reports, papers, financial statements, technical documents — and produces clean text, structured table data, and document metadata.
When to use this skill
- Extracting body text from a PDF for further processing, indexing, or analysis
- Pulling specific tables from financial reports, invoices, or data sheets
- Reading PDF metadata: title, author, creation date, page count, producer
- Extracting text from specific regions of a page (headers, footers, sidebars) using coordinate-based selection
- Building a PDF processing pipeline that feeds text into NLP, search, or structured data extraction
- Comparing text content across multiple PDF versions
Do not use this skill when
- The PDF is scanned or image-based (text extraction returns empty/garbled) — prefer
for OCRimage-heavy-pdfs - The task is modifying the PDF (merge, split, rotate, watermark) — prefer
pdf-editor - The task is creating a new PDF from data — prefer
pdf-generation - The extracted data needs entity extraction and schema mapping — use this skill first, then pipe to
document-to-structured-data
Operating procedure
- Install dependencies — ensure the extraction library is available. For pdfplumber:
. For PyMuPDF:pip install pdfplumber
. For tabula-py:pip install PyMuPDF
(requires Java runtime). For camelot:pip install tabula-py
.pip install camelot-py[cv] - Open the PDF — load with
orpdfplumber.open('file.pdf')
. Verify the file opens without errors and is not encrypted. Check page count and log it.fitz.open('file.pdf') - Detect content type — check if pages contain extractable text by calling
on the first 3 pages. If text extraction returns fewer than 10 characters per page, the PDF is likely image-based — redirect topage.extract_text()
.image-heavy-pdfs - Extract full text — iterate over all pages and extract text. Use
(pdfplumber) orpage.extract_text()
(PyMuPDF). Preserve page boundaries in the output with page number markers.page.get_text("text") - Extract layout-aware text — for documents with columns, headers, or complex layouts, use
(PyMuPDF) to get text blocks with bounding box coordinates. Sort blocks by reading order: top-to-bottom, left-to-right. Detect multi-column layouts by identifying vertical gaps between text blocks.page.get_text("blocks") - Extract tables — use pdfplumber's
for simple grid tables. For complex tables without visible borders, use camelot withpage.extract_tables()
. For tables spanning multiple pages, extract from each page and concatenate matching column structures.flavor='stream' - Extract specific regions — define a bounding box
for the target region. Use(x0, y0, x1, y1)
(pdfplumber) orpage.within_bbox(bbox).extract_text()
(PyMuPDF). Use this for extracting headers, footers, margin notes, or specific form fields.page.get_text("text", clip=rect) - Extract metadata — read the document info dictionary: title, author, subject, keywords, creator, producer, creation date, modification date. Use
(pdfplumber) orpdf.metadata
(PyMuPDF). Also extract: page count, page dimensions, PDF version.pdf.metadata - Extract hyperlinks and annotations — identify clickable links, bookmarks, and annotations on each page. Use
(PyMuPDF) to get annotation coordinates, types, and linked URIs.page.annots() - Clean extracted text — remove soft hyphens at line breaks (re-join hyphenated words), normalize whitespace (collapse multiple spaces/newlines), fix common encoding artifacts (ligatures: fi, fl, ff), and strip headers/footers that repeat on every page.
- Structure the output — organize extracted content by page. For each page, output: page number, full text, tables (as list of row dictionaries), and annotations. Produce the final output as JSON, Markdown, or plain text per the requester's needs.
- Validate extraction quality — spot-check 3-5 pages by comparing extracted text against the visual PDF content. Verify table column counts match, reading order is correct, and no text blocks were missed.
Decision rules
- Use pdfplumber as the default for general text and table extraction — it handles most documents well.
- Use PyMuPDF (fitz) when speed is critical (10× faster than pdfplumber for text-only extraction) or when coordinate-level precision is needed.
- Use tabula-py or camelot when pdfplumber fails on complex tables — camelot's stream mode handles borderless tables better.
- If text extraction returns garbled Unicode, try different extraction parameters or switch libraries before assuming the PDF is image-based.
- For multi-column academic papers, always use layout-aware extraction (block-based) rather than simple text extraction.
- If a table has merged cells, extract the raw grid and post-process to fill merged values downward/rightward.
Output requirements
- Extracted text — clean text organized by page number, with reading order preserved
- Table data — each table as a list of row dictionaries with normalized column headers, exportable to CSV or JSON
- Document metadata — title, author, creation date, page count, PDF version, producer
- Extraction report — pages processed, tables found, extraction method used per page, any pages that returned empty text
- Quality flags — pages with potential extraction issues (low text density, garbled characters, missing expected tables)
References
- pdfplumber documentation — https://github.com/jsvine/pdfplumber
- PyMuPDF (fitz) documentation — https://pymupdf.readthedocs.io/en/latest/
- tabula-py documentation — https://tabula-py.readthedocs.io/en/latest/
- camelot documentation — https://camelot-py.readthedocs.io/en/master/
Related skills
— for scanned/image-based PDFs requiring OCR before text extractionimage-heavy-pdfs
— for modifying PDF structure (merge, split, rotate, watermark)pdf-editor
— for focused table extraction across multiple document formatstable-extraction
— for mapping extracted text to target schemasdocument-to-structured-data
Anti-patterns
- Using simple text extraction on multi-column PDFs — produces interleaved column text that reads as nonsense. Always use layout-aware extraction for complex layouts.
- Assuming all PDFs have text layers — failing to detect image-only PDFs wastes time on empty extraction. Always check text density first.
- Extracting tables as plain text — loses column structure entirely. Always use table-specific extraction methods.
- Ignoring page headers and footers — repeated headers/footers contaminate extracted text. Detect and strip them by identifying text that repeats at the same coordinates across pages.
- Hardcoding bounding boxes — coordinates differ between PDF versions and page sizes. Use content-anchored regions (find a label, then extract relative to it) when possible.
Failure handling
- If the PDF is encrypted, attempt to open with an empty password (some PDFs have owner-only restrictions). If that fails, report the encryption type and halt.
- If a page returns zero text but is not blank (has visible content), flag it as a potential image page and suggest running
on that page.image-heavy-pdfs - If table extraction produces inconsistent column counts across rows, log the raw extraction and attempt re-extraction with a different library or parameters.
- If the required library is not installed, output the exact install command and halt.
- If the PDF is malformed (broken cross-references, truncated), attempt repair with pikepdf's
before extraction. Log that repair was needed.open(path, allow_overwriting_input=True)