Skills table-extractor
git clone https://github.com/TerminalSkills/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/table-extractor" ~/.claude/skills/terminalskills-skills-table-extractor && rm -rf "$T"
skills/table-extractor/SKILL.md- uses sudo
- pip install
Table Extractor
Overview
Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.
Instructions
When a user asks you to extract tables from a PDF, follow this process:
Step 1: Install and verify dependencies
# Install camelot and its dependencies pip install "camelot-py[base]" ghostscript opencv-python-headless pandas # Verify ghostscript is available (required by camelot) gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"
If ghostscript is not available, fall back to pdfplumber:
pip install pdfplumber pandas
Step 2: Inspect the PDF to locate tables
import camelot # Quick scan: how many tables are in the document? tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice") print(f"Found {len(tables)} tables using lattice detection") # If no tables found, try stream detection (for borderless tables) if len(tables) == 0: tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream") print(f"Found {len(tables)} tables using stream detection") # Summary of each table for i, table in enumerate(tables): print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})") print(f"Accuracy: {table.accuracy:.1f}%") print(table.df.head(3))
Step 3: Choose the right extraction flavor
Lattice flavor (for tables with visible borders/gridlines):
tables = camelot.read_pdf( "document.pdf", pages="1,2,3", # Specific pages flavor="lattice", line_scale=40, # Adjust line detection sensitivity process_background=True # Detect lines on colored backgrounds )
Stream flavor (for borderless tables, whitespace-separated):
tables = camelot.read_pdf( "document.pdf", pages="1", flavor="stream", edge_tol=50, # Tolerance for edge detection row_tol=10, # Tolerance for grouping text into rows columns=["72,200,350,500"] # Manual column boundaries if auto-detect fails )
Step 4: Clean and process extracted tables
import pandas as pd for i, table in enumerate(tables): df = table.df # Promote first row to header if it contains column names if df.iloc[0].str.match(r'^[A-Za-z]').all(): df.columns = df.iloc[0] df = df[1:].reset_index(drop=True) # Clean whitespace and newlines within cells df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True)) # Remove completely empty rows df = df.dropna(how='all').replace('', pd.NA).dropna(how='all') # Convert numeric columns for col in df.columns: try: df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', '')) except (ValueError, AttributeError): pass # Keep as string print(f"\nCleaned Table {i}:") print(df.head())
Step 5: Handle complex table structures
Merged cells and spanning headers:
# Forward-fill merged cells (common in row headers) df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill() # Handle multi-level column headers if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20: # Combine first two rows as multi-level header new_cols = df.iloc[0] + " - " + df.iloc[1] df.columns = new_cols.str.strip(" - ") df = df[2:].reset_index(drop=True)
Tables spanning multiple pages:
# Extract from all pages and concatenate all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice") # Group tables that are continuations (same column count) groups = {} for t in all_tables: key = t.shape[1] groups.setdefault(key, []).append(t.df) for col_count, dfs in groups.items(): combined = pd.concat(dfs, ignore_index=True) # Remove duplicate header rows that appear at page breaks combined = combined[~combined.duplicated(keep='first')]
Step 6: Export the results
# CSV (one file per table) for i, table in enumerate(tables): table.df.to_csv(f"table_{i+1}.csv", index=False) # Excel (all tables as separate sheets) with pd.ExcelWriter("extracted_tables.xlsx") as writer: for i, table in enumerate(tables): table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False) # JSON for i, table in enumerate(tables): table.df.to_json(f"table_{i+1}.json", orient="records", indent=2) print(f"Exported {len(tables)} tables")
Examples
Example 1: Extract financial tables from an annual report
User request: "Extract all tables from this annual report PDF"
Actions:
- Scan all pages with lattice flavor (financial reports typically have bordered tables)
- Identify income statement, balance sheet, and cash flow tables by column headers
- Clean numeric values (remove $, commas, parentheses for negatives)
- Export each table to a separate CSV and combine into one Excel workbook
Output: "Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs."
Example 2: Extract a specific table from a research paper
User request: "Get the results table from page 8 of this paper"
Actions:
- Target page 8 specifically:
camelot.read_pdf("paper.pdf", pages="8") - If multiple tables on the page, show summaries and let the user pick
- Clean the extracted table and handle any multi-line cells
- Export as CSV
Output: A single CSV file with the results table, plus a preview of the first few rows printed to the console.
Example 3: Batch process multiple PDFs
User request: "Extract the summary table from each of these 20 monthly reports"
Actions:
import glob results = [] for pdf_path in sorted(glob.glob("reports/*.pdf")): tables = camelot.read_pdf(pdf_path, pages="1", flavor="lattice") if tables: df = tables[0].df # First table on first page df["source_file"] = pdf_path results.append(df) combined = pd.concat(results, ignore_index=True) combined.to_csv("all_summaries.csv", index=False)
Output: A single CSV combining the summary table from all 20 reports with a source_file column for traceability.
Guidelines
- Always try
flavor first (bordered tables). Fall back tolattice
for borderless tables.stream - Check the
score on each table. Below 80% indicates extraction issues that need manual review.accuracy - For scanned PDFs, run OCR first (e.g.,
) before table extraction.ocrmypdf - When camelot struggles, try pdfplumber as an alternative:
.page.extract_table(table_settings={...}) - Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.
- For tables with merged cells, use forward-fill on the appropriate columns.
- When extracting from multiple pages, watch for repeated header rows at page breaks.
- Always preview the extracted data before exporting to catch alignment or parsing issues.
- Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.