Claude-skill-registry financial-document-processor

Guidance for processing, classifying, and extracting data from financial documents (invoices, receipts, statements). This skill should be used when tasks involve OCR extraction, document classification, data validation from financial PDFs/images, or batch processing of financial documents. Covers safe file operations, incremental testing, and data extraction verification.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/financial-document-processor" ~/.claude/skills/majiayu000-claude-skill-registry-financial-document-processor && rm -rf "$T"

manifest: skills/data/financial-document-processor/SKILL.md

Financial Document Processor

Overview

This skill provides procedural guidance for processing financial documents such as invoices, receipts, and statements. It covers document classification, data extraction (amounts, VAT, dates), and batch processing workflows with emphasis on safe operations and verification.

Critical Principles

1. Never Perform Destructive Operations Without Backup

Before any file move, delete, or modification operation:

Create explicit backups:

cp -r /app/documents /app/documents_backup

Verify backup exists before proceeding:
```
ls -la /app/documents_backup
```
Use copy-then-delete pattern instead of move when testing
Never chain
```
rm
```
with
```
mv
```
in a single command without verification between steps

Dangerous pattern to avoid:

# WRONG: Files deleted before move can execute
rm -f /app/invoices/*.pdf && mv /app/other/* /app/documents/

Safe pattern:

# CORRECT: Create backup first, verify, then operate
cp -r /app/documents /app/documents_backup
ls /app/documents_backup  # Verify backup
# Now proceed with operations

2. Test Incrementally on Single Documents First

Before processing a batch of documents:

Select one representative document from each category (invoice, receipt, statement)
Run extraction on the single document
Manually verify extracted values against the source document
Only proceed to batch processing after single-document validation succeeds

3. Validate Before Declaring Success

Never mark a task complete without verification:

Check that output files exist in expected locations
Verify extracted data contains non-zero/non-empty values where expected
Cross-reference a sample of extracted values against source documents
If extraction produces mostly zeros or empty values, the extraction logic is failing

Document Processing Workflow

Phase 1: Assessment

Inventory documents: List all files to process with types and counts
Sample inspection: Read/view 2-3 representative documents to understand format
Identify challenges: Note format variations (European decimals, date formats, multi-page documents)
Plan extraction strategy: Determine what tools/libraries are needed (OCR, PDF text extraction)

Phase 2: Implementation with Safety

Create working backup: Always backup source documents before any processing
Build extraction logic: Implement one extraction pattern at a time
Test on single document: Validate each pattern before combining
Handle edge cases explicitly: See
```
references/extraction_patterns.md
```
for common patterns

Phase 3: Batch Processing

Process in small batches: Start with 5-10 documents, verify results
Implement logging: Log each document processed with extraction results
Flag low-confidence extractions: Mark documents where extraction may have failed
Generate summary only after verification: Create summary.csv after confirming extraction quality

Phase 4: Verification

Spot-check results: Manually verify 10-20% of extracted values
Check for systematic failures: Look for patterns in failed extractions
Validate totals: If summing amounts, verify against expected totals
Confirm file organization: Verify documents are in correct output directories

Common Extraction Challenges

European Number Formatting

European locales use comma as decimal separator (1.234,56 instead of 1,234.56):

def parse_european_number(text):
    """Convert European format number to float."""
    # Remove thousand separators (periods)
    text = text.replace('.', '')
    # Convert decimal comma to period
    text = text.replace(',', '.')
    return float(text)

VAT/Tax Amount Extraction

VAT may appear in multiple formats:

"VAT: 20%", "VAT 20.00", "Tax: $20.00"
May be absent entirely (set to 0 or empty string per requirements)
May need calculation from gross/net amounts

Total vs Amount Due

Invoices may have multiple "total" values:

Subtotal (before tax)
Total (after tax)
Amount Due (final payable amount)

Prioritize "Amount Due" or final total when multiple values exist.

OCR Quality Issues

For image-based documents:

Implement confidence scoring when available
Flag documents with low OCR confidence for manual review
Consider pre-processing (contrast adjustment, deskewing) for poor quality scans

Verification Checklist

Before declaring document processing complete, verify:

All source documents accounted for (none missing)
Documents classified to correct output directories
Extracted amounts are non-zero for documents that should have amounts
Date formats are consistent in output
Summary CSV contains expected number of rows
Spot-checked 10-20% of extractions against source documents
No files were lost during processing (compare input/output counts)

Resources

references/

```
extraction_patterns.md
```
- Regex patterns and extraction strategies for common financial document formats

When to Flag for Manual Review

Flag a document for manual review when:

OCR confidence is below 80%
Extracted total amount is 0 but document clearly shows amounts
Multiple conflicting "total" values found
Date cannot be parsed from document
Document classification is ambiguous