Claude-skill-registry financial-document-processor
Guidance for processing, classifying, and extracting data from financial documents (invoices, receipts, statements). This skill should be used when tasks involve OCR extraction, document classification, data validation from financial PDFs/images, or batch processing of financial documents. Covers safe file operations, incremental testing, and data extraction verification.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/financial-document-processor" ~/.claude/skills/majiayu000-claude-skill-registry-financial-document-processor && rm -rf "$T"
skills/data/financial-document-processor/SKILL.mdFinancial Document Processor
Overview
This skill provides procedural guidance for processing financial documents such as invoices, receipts, and statements. It covers document classification, data extraction (amounts, VAT, dates), and batch processing workflows with emphasis on safe operations and verification.
Critical Principles
1. Never Perform Destructive Operations Without Backup
Before any file move, delete, or modification operation:
- Create explicit backups:
cp -r /app/documents /app/documents_backup - Verify backup exists before proceeding:
ls -la /app/documents_backup - Use copy-then-delete pattern instead of move when testing
- Never chain
withrm
in a single command without verification between stepsmv
Dangerous pattern to avoid:
# WRONG: Files deleted before move can execute rm -f /app/invoices/*.pdf && mv /app/other/* /app/documents/
Safe pattern:
# CORRECT: Create backup first, verify, then operate cp -r /app/documents /app/documents_backup ls /app/documents_backup # Verify backup # Now proceed with operations
2. Test Incrementally on Single Documents First
Before processing a batch of documents:
- Select one representative document from each category (invoice, receipt, statement)
- Run extraction on the single document
- Manually verify extracted values against the source document
- Only proceed to batch processing after single-document validation succeeds
3. Validate Before Declaring Success
Never mark a task complete without verification:
- Check that output files exist in expected locations
- Verify extracted data contains non-zero/non-empty values where expected
- Cross-reference a sample of extracted values against source documents
- If extraction produces mostly zeros or empty values, the extraction logic is failing
Document Processing Workflow
Phase 1: Assessment
- Inventory documents: List all files to process with types and counts
- Sample inspection: Read/view 2-3 representative documents to understand format
- Identify challenges: Note format variations (European decimals, date formats, multi-page documents)
- Plan extraction strategy: Determine what tools/libraries are needed (OCR, PDF text extraction)
Phase 2: Implementation with Safety
- Create working backup: Always backup source documents before any processing
- Build extraction logic: Implement one extraction pattern at a time
- Test on single document: Validate each pattern before combining
- Handle edge cases explicitly: See
for common patternsreferences/extraction_patterns.md
Phase 3: Batch Processing
- Process in small batches: Start with 5-10 documents, verify results
- Implement logging: Log each document processed with extraction results
- Flag low-confidence extractions: Mark documents where extraction may have failed
- Generate summary only after verification: Create summary.csv after confirming extraction quality
Phase 4: Verification
- Spot-check results: Manually verify 10-20% of extracted values
- Check for systematic failures: Look for patterns in failed extractions
- Validate totals: If summing amounts, verify against expected totals
- Confirm file organization: Verify documents are in correct output directories
Common Extraction Challenges
European Number Formatting
European locales use comma as decimal separator (1.234,56 instead of 1,234.56):
def parse_european_number(text): """Convert European format number to float.""" # Remove thousand separators (periods) text = text.replace('.', '') # Convert decimal comma to period text = text.replace(',', '.') return float(text)
VAT/Tax Amount Extraction
VAT may appear in multiple formats:
- "VAT: 20%", "VAT 20.00", "Tax: $20.00"
- May be absent entirely (set to 0 or empty string per requirements)
- May need calculation from gross/net amounts
Total vs Amount Due
Invoices may have multiple "total" values:
- Subtotal (before tax)
- Total (after tax)
- Amount Due (final payable amount)
Prioritize "Amount Due" or final total when multiple values exist.
OCR Quality Issues
For image-based documents:
- Implement confidence scoring when available
- Flag documents with low OCR confidence for manual review
- Consider pre-processing (contrast adjustment, deskewing) for poor quality scans
Verification Checklist
Before declaring document processing complete, verify:
- All source documents accounted for (none missing)
- Documents classified to correct output directories
- Extracted amounts are non-zero for documents that should have amounts
- Date formats are consistent in output
- Summary CSV contains expected number of rows
- Spot-checked 10-20% of extractions against source documents
- No files were lost during processing (compare input/output counts)
Resources
references/
- Regex patterns and extraction strategies for common financial document formatsextraction_patterns.md
When to Flag for Manual Review
Flag a document for manual review when:
- OCR confidence is below 80%
- Extracted total amount is 0 but document clearly shows amounts
- Multiple conflicting "total" values found
- Date cannot be parsed from document
- Document classification is ambiguous