Skills-for-architects product-spec-pdf-parser
Extract structured FF&E product specs from PDF files — price books, fact sheets, and spec sheets. Claude reads extracted text and structures products into a standardized schedule.
git clone https://github.com/AlpacaLabsLLC/skills-for-architects
T=$(mktemp -d) && git clone --depth=1 https://github.com/AlpacaLabsLLC/skills-for-architects "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/06-materials-research/skills/product-spec-pdf-parser" ~/.claude/skills/alpacalabsllc-skills-for-architects-product-spec-pdf-parser && rm -rf "$T"
plugins/06-materials-research/skills/product-spec-pdf-parser/SKILL.md/product-spec-pdf-parser — PDF Product Spec Parser
Extract structured FF&E data from product PDF files — price books, fact sheets, configurator sheets, and spec sheets. Uses PyMuPDF for text extraction and Claude's reasoning to parse wildly varying PDF layouts into a standardized schedule.
Input
The user provides PDFs in one of these ways:
- File paths — one or more PDF file paths
- Folder path — a directory containing PDFs (will process all
files).pdf - Just invoked — ask the user for file paths or a folder
Also ask (or use defaults):
- Output destination — Google Sheet, local CSV, or markdown (default: ask)
- Variant depth —
(one row per variant/SKU, default) orexpand
(comma-separated variants in one row)summarize
Output Schema
Products are written to the master Google Sheet — the same 33-column schema used by all product skills, plus PDF-specific extra columns. When writing to CSV, use the same column order.
Read
../../schema/product-schema.md (relative to this SKILL.md) for the full column reference, field formats, and category vocabulary. Read ../../schema/sheet-conventions.md for CRUD patterns with MCP tools.
Skill-specific column values:
- AG (Source):
pdf-parser - AF (Status):
saved - J (Link): Blank (no URL for PDFs)
- D (Thumbnail): Blank (no image URL typically)
- C (Vendor): Blank (source is PDF, not a retailer)
- V (Sale Price): Blank (PDFs don't have sale prices)
- AC (Image URL): Blank (no image from PDF)
PDF-specific data in Notes (col AE)
PDFs contain fields that don't have dedicated master columns. Append these to Notes using
| as delimiter:
- Variant:
Variant: Diamond, Black - Price Adder:
Price adder: +$130 (PostureFit SL) - Country of Origin:
Origin: Sweden - Source File:
Source: alphabeta-fact-sheet.pdf
Example Notes cell:
Variant: Diamond, Black | Origin: Sweden | Source: alphabeta-fact-sheet.pdf
Variant Handling
Different PDF types require different approaches:
Fact sheets with SKUs (e.g., Alphabeta lamp)
- One row per SKU. Each shade shape × color = one row.
- Product Name stays the same across rows. Variant describes the distinguishing attributes.
- Example: "Alphabeta Floor Lamp" / Variant: "Diamond, Black" / SKU: "..."
Fact sheets with upholstery/finish combos (e.g., Puffy lounge chair)
- One row per upholstery option. Frame finish goes in Colors/Finishes.
- Distinct products (chair + ottoman) each get their own set of rows.
- Example: "Puffy Lounge Chair" / Variant: "Traffic Red" / Colors/Finishes: "Chrome frame"
Price books / configurators (e.g., Aeron price book)
- One row per distinct product type (e.g., Work Chair, Stool, Side Chair).
- Base configuration in main fields. Summarize configuration options — do NOT explode every permutation.
- Use Price Adder for incremental costs of add-ons or upgrades.
- Example: "Aeron Chair" / Variant: "Size B, Graphite" / List Price: 1395.00 / Price Adder: 130.00 (PostureFit SL)
expand
vs summarize
mode
expandsummarize- expand (default): One row per variant, SKU, or distinct option. Best for procurement and ordering.
- summarize: One row per product. Colors/Finishes and Variant are comma-separated lists. Best for quick reference.
Workflow
Step 1: Get input
Parse the user's input to identify PDF file(s) and output preferences.
- If given a folder, list all
files and report count.pdf - If no PDFs found or path is invalid, ask the user
- Confirm variant depth — default to
unless the user says otherwiseexpand - Report: "Found N PDF(s) to process."
Step 2: Extract text from PDF
Use PyMuPDF (fitz) to extract text from each PDF. Run this Python script via Bash:
import fitz import sys import json pdf_path = sys.argv[1] doc = fitz.open(pdf_path) pages = [] for i, page in enumerate(doc): text = page.get_text() pages.append({"page": i + 1, "text": text}) doc.close() print(json.dumps({"filename": pdf_path.split("/")[-1], "total_pages": len(pages), "pages": pages}))
For each PDF, extract all pages and save the JSON output.
Step 3: Parse products with Claude
Read the extracted text and identify all products, variants, and specifications. This is the core intelligence step — Claude reasons over the text to structure it.
For small PDFs (≤20 pages): Process all pages at once.
For large PDFs (>20 pages): Process in chunks of 10 pages at a time. After each chunk:
- Accumulate parsed products
- Carry forward context (product name, brand, any ongoing configuration table)
- At the end, deduplicate and merge
Parsing instructions:
- Identify the document type — fact sheet, price book, configurator, spec sheet, catalog
- Extract global fields first — brand, designer, collection, warranty, certifications, country of origin (these usually appear once)
- Find product boundaries — headings, page breaks, or new product names signal a new product
- For each product, extract all variants based on the variant handling rules above
- Map dimensions carefully — PDFs often format dimensions as "W × D × H" or in a spec table. Parse into separate W, D, H fields.
- Prices — distinguish between base price and adders. If a configurator shows "Base: $1,395 / Add: $130 for PostureFit", set List Price = 1395, Price Adder = 130
- Leave fields blank rather than guessing — if a field isn't in the PDF, leave it empty
Step 4: Present results
Show a summary markdown table with the parsed products. Include:
- Row count per PDF
- Any issues or assumptions made
- Sample of the first 10 rows if large
Ask: "Does this look correct? Should I adjust anything before saving?"
Step 5: Write output
Ask the user (if not already specified): "Where should I save this?"
Options:
- Master Google Sheet — append rows to the shared product library. Ask for spreadsheet ID if not already known.
- Local CSV — save to a specified path (default:
)./ffe-pdf-parse-YYYY-MM-DD.csv - Just the table — leave as markdown in the conversation
CSV Format
When saving to CSV, use the CSV header from
../../schema/product-schema.md.
Google Sheets Format
Append rows to the master Google Sheet using the same 33-column schema. Set
Clipped At to current timestamp and Source to pdf-parser. PDF-specific data (variant, price adder, country of origin, source filename) goes in the Notes column.
Edge Cases
- Scanned PDFs (image-only): PyMuPDF will return empty or garbage text. Detect this (very short text relative to page count) and tell the user: "This PDF appears to be scanned/image-based. Text extraction won't work — consider using an OCR tool first."
- Multi-language PDFs: Extract data as-is. Note the language. The cleanup skill handles translation.
- PDFs with tables as images: Common in price books. If a section seems to have missing data despite being a spec-heavy document, note it and flag for manual review.
- Password-protected PDFs: PyMuPDF will fail to open. Catch the error and tell the user.
- Very large PDFs (100+ pages): Process in 10-page chunks. Give progress updates every 20 pages.
- Mixed product types in one PDF: Handle each product type independently. A catalog with chairs AND tables gets rows for both.
Error Reporting
After processing, always report:
Parsed: X products from Y PDF(s) - filename.pdf: N products extracted - filename2.pdf: M products extracted Issues: [list any problems]