Skilllibrary pdf-extraction

Extract text, tables, and metadata from text-native PDFs using pdfplumber, PyMuPDF (fitz), tabula-py, or camelot with layout-aware parsing and coordinate-based region selection. Use when pulling structured content from PDFs, extracting specific tables, or reading PDF metadata. Do not use for scanned/image PDFs (prefer image-heavy-pdfs) or modifying PDF files (prefer pdf-editor).

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/15-docs-artifacts-media/pdf-extraction" ~/.claude/skills/merceralex397-collab-skilllibrary-pdf-extraction && rm -rf "$T"

manifest: 15-docs-artifacts-media/pdf-extraction/SKILL.md

source content

Purpose

Extract text, tables, and metadata from text-native PDFs using pdfplumber, PyMuPDF (fitz), tabula-py, or camelot with layout-aware parsing, coordinate-based region selection, and structural analysis. This skill handles PDFs that have embedded text layers — reports, papers, financial statements, technical documents — and produces clean text, structured table data, and document metadata.

When to use this skill

Extracting body text from a PDF for further processing, indexing, or analysis
Pulling specific tables from financial reports, invoices, or data sheets
Reading PDF metadata: title, author, creation date, page count, producer
Extracting text from specific regions of a page (headers, footers, sidebars) using coordinate-based selection
Building a PDF processing pipeline that feeds text into NLP, search, or structured data extraction
Comparing text content across multiple PDF versions

Do not use this skill when

The PDF is scanned or image-based (text extraction returns empty/garbled) — prefer
```
image-heavy-pdfs
```
for OCR
The task is modifying the PDF (merge, split, rotate, watermark) — prefer
```
pdf-editor
```
The task is creating a new PDF from data — prefer
```
pdf-generation
```
The extracted data needs entity extraction and schema mapping — use this skill first, then pipe to
```
document-to-structured-data
```

Operating procedure

Install dependencies — ensure the extraction library is available. For pdfplumber:
```
pip install pdfplumber
```
. For PyMuPDF:
```
pip install PyMuPDF
```
. For tabula-py:
```
pip install tabula-py
```
(requires Java runtime). For camelot:
```
pip install camelot-py[cv]
```
.
Open the PDF — load with
```
pdfplumber.open('file.pdf')
```
or
```
fitz.open('file.pdf')
```
. Verify the file opens without errors and is not encrypted. Check page count and log it.
Detect content type — check if pages contain extractable text by calling
```
page.extract_text()
```
on the first 3 pages. If text extraction returns fewer than 10 characters per page, the PDF is likely image-based — redirect to
```
image-heavy-pdfs
```
.
Extract full text — iterate over all pages and extract text. Use
```
page.extract_text()
```
(pdfplumber) or
```
page.get_text("text")
```
(PyMuPDF). Preserve page boundaries in the output with page number markers.
Extract layout-aware text — for documents with columns, headers, or complex layouts, use
```
page.get_text("blocks")
```
(PyMuPDF) to get text blocks with bounding box coordinates. Sort blocks by reading order: top-to-bottom, left-to-right. Detect multi-column layouts by identifying vertical gaps between text blocks.
Extract tables — use pdfplumber's
```
page.extract_tables()
```
for simple grid tables. For complex tables without visible borders, use camelot with
```
flavor='stream'
```
. For tables spanning multiple pages, extract from each page and concatenate matching column structures.
Extract specific regions — define a bounding box
```
(x0, y0, x1, y1)
```
for the target region. Use
```
page.within_bbox(bbox).extract_text()
```
(pdfplumber) or
```
page.get_text("text", clip=rect)
```
(PyMuPDF). Use this for extracting headers, footers, margin notes, or specific form fields.
Extract metadata — read the document info dictionary: title, author, subject, keywords, creator, producer, creation date, modification date. Use
```
pdf.metadata
```
(pdfplumber) or
```
pdf.metadata
```
(PyMuPDF). Also extract: page count, page dimensions, PDF version.
Extract hyperlinks and annotations — identify clickable links, bookmarks, and annotations on each page. Use
```
page.annots()
```
(PyMuPDF) to get annotation coordinates, types, and linked URIs.
Clean extracted text — remove soft hyphens at line breaks (re-join hyphenated words), normalize whitespace (collapse multiple spaces/newlines), fix common encoding artifacts (ligatures: fi, fl, ff), and strip headers/footers that repeat on every page.
Structure the output — organize extracted content by page. For each page, output: page number, full text, tables (as list of row dictionaries), and annotations. Produce the final output as JSON, Markdown, or plain text per the requester's needs.
Validate extraction quality — spot-check 3-5 pages by comparing extracted text against the visual PDF content. Verify table column counts match, reading order is correct, and no text blocks were missed.

Decision rules

Use pdfplumber as the default for general text and table extraction — it handles most documents well.
Use PyMuPDF (fitz) when speed is critical (10× faster than pdfplumber for text-only extraction) or when coordinate-level precision is needed.
Use tabula-py or camelot when pdfplumber fails on complex tables — camelot's stream mode handles borderless tables better.
If text extraction returns garbled Unicode, try different extraction parameters or switch libraries before assuming the PDF is image-based.
For multi-column academic papers, always use layout-aware extraction (block-based) rather than simple text extraction.
If a table has merged cells, extract the raw grid and post-process to fill merged values downward/rightward.

Output requirements

Extracted text — clean text organized by page number, with reading order preserved
Table data — each table as a list of row dictionaries with normalized column headers, exportable to CSV or JSON
Document metadata — title, author, creation date, page count, PDF version, producer
Extraction report — pages processed, tables found, extraction method used per page, any pages that returned empty text
Quality flags — pages with potential extraction issues (low text density, garbled characters, missing expected tables)

References

pdfplumber documentation — https://github.com/jsvine/pdfplumber
PyMuPDF (fitz) documentation — https://pymupdf.readthedocs.io/en/latest/
tabula-py documentation — https://tabula-py.readthedocs.io/en/latest/
camelot documentation — https://camelot-py.readthedocs.io/en/master/

Related skills

```
image-heavy-pdfs
```
— for scanned/image-based PDFs requiring OCR before text extraction
```
pdf-editor
```
— for modifying PDF structure (merge, split, rotate, watermark)
```
table-extraction
```
— for focused table extraction across multiple document formats
```
document-to-structured-data
```
— for mapping extracted text to target schemas

Anti-patterns

Using simple text extraction on multi-column PDFs — produces interleaved column text that reads as nonsense. Always use layout-aware extraction for complex layouts.
Assuming all PDFs have text layers — failing to detect image-only PDFs wastes time on empty extraction. Always check text density first.
Extracting tables as plain text — loses column structure entirely. Always use table-specific extraction methods.
Ignoring page headers and footers — repeated headers/footers contaminate extracted text. Detect and strip them by identifying text that repeats at the same coordinates across pages.
Hardcoding bounding boxes — coordinates differ between PDF versions and page sizes. Use content-anchored regions (find a label, then extract relative to it) when possible.

Failure handling

If the PDF is encrypted, attempt to open with an empty password (some PDFs have owner-only restrictions). If that fails, report the encryption type and halt.
If a page returns zero text but is not blank (has visible content), flag it as a potential image page and suggest running
```
image-heavy-pdfs
```
on that page.
If table extraction produces inconsistent column counts across rows, log the raw extraction and attempt re-extraction with a different library or parameters.
If the required library is not installed, output the exact install command and halt.
If the PDF is malformed (broken cross-references, truncated), attempt repair with pikepdf's
```
open(path, allow_overwriting_input=True)
```
before extraction. Log that repair was needed.