Claude-Skills tabular-document-review

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/legal/tabular-document-review" ~/.claude/skills/borghei-claude-skills-tabular-document-review && rm -rf "$T"
manifest: legal/tabular-document-review/SKILL.md
source content

⚠️ EXPERIMENTAL — This skill is provided for educational and informational purposes only. It does NOT constitute legal advice. All responsibility for usage rests with the user. Consult qualified legal professionals before acting on any output.

Tabular Document Review Skill

Overview

Production-ready toolkit for extracting structured data from multiple legal documents into a comparison matrix with citations. Supports user-defined extraction columns, parallel processing with up to 10 agents, confidence scoring, and output in markdown table or structured JSON. Designed for legal teams performing bulk contract review, NDA comparison, employment agreement analysis, and lease review.

Table of Contents

Tools

1. Document Discovery (
scripts/document_discovery.py
)

Scan a directory for legal documents and generate an inventory manifest.

python scripts/document_discovery.py /path/to/contracts

python scripts/document_discovery.py /path/to/ndas --types pdf,docx --json

python scripts/document_discovery.py /path/to/leases --types pdf,docx,txt,md --min-size 1024

2. Extraction Aggregator (
scripts/extraction_aggregator.py
)

Aggregate multiple extraction result JSONs into a unified comparison matrix.

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json extraction_3.json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ --json

python scripts/extraction_aggregator.py \
  --results-dir ./extraction_results/ \
  --format markdown \
  --output review_matrix.md

python scripts/extraction_aggregator.py \
  --results extraction_1.json extraction_2.json \
  --columns "Parties,Effective Date,Term,Governing Law"

Reference Guides

ReferencePurpose
references/extraction_methodology.md
Document extraction best practices, JSON schema, agent prompts
references/common_extraction_columns.md
Pre-defined column sets for contracts, NDAs, employment, leases

Workflows

5-Step Document Review Pipeline

StepActionToolOutput
1. Gather RequirementsDefine document folder, output filename, columns to extractManualColumn list, file path
2. Discover DocumentsScan directory for target documents
document_discovery.py
Document manifest JSON
3. Process DocumentsExtract values per column with citations (parallel agents)AI agents (external)Per-document extraction JSONs
4. Collect ResultsAggregate extraction JSONs into unified matrix
extraction_aggregator.py
Consolidated matrix
5. Generate OutputExport as markdown table or structured JSON
extraction_aggregator.py
Final deliverable

Parallel Processing Strategy

AgentsDocuments per AgentUse When
1All1-5 documents
2-3ceil(N/agents)6-15 documents
4-6ceil(N/agents)16-40 documents
7-10ceil(N/agents)41-100 documents
10 (max)ceil(N/10)100+ documents

Agent Prompt Template

Each agent receives a prompt structured as:

You are reviewing {count} legal documents. For each document, extract the
following columns:

{column_definitions}

For each value extracted:
1. Provide the exact value found
2. Include the page number (PDF) or section/paragraph (DOCX/MD)
3. Rate your confidence: HIGH (exact match), MEDIUM (inferred), LOW (uncertain)
4. If not found, record "NOT FOUND" with confidence LOW

Output as JSON per the extraction schema.

Confidence Scoring

LevelColor CodeDefinition
HIGHGreenExact value found with clear citation
MEDIUMYellowValue inferred from context; multiple possible interpretations
LOWRed / Not FoundValue uncertain or not found in document

Output Format

Sheet 1: Document Review

DocumentPartiesEffective DateTermGoverning Law...
contract_a.pdfAcme / Beta [p.1]2026-01-15 [p.2]3 years [p.3]Delaware [p.12]...
contract_b.pdfGamma / Delta [p.1]NOT FOUND2 years [p.4]New York [p.10]...

Sheet 2: Summary

MetricValue
Documents processed25
Columns extracted8
Average confidence87%
Not found rate12%

Extraction Scenarios

Contract Review

ColumnWhat to Extract
PartiesAll contracting parties with full legal names
Effective DateContract effective or execution date
TermDuration of the agreement
RenewalAuto-renewal terms and notice period
Governing LawJurisdiction governing the agreement
Liability CapMaximum liability amount or formula
IndemnificationIndemnification obligations and scope
IP OwnershipIntellectual property ownership provisions
Termination RightsTermination triggers and notice requirements
Data ProtectionData protection or privacy obligations

NDA Review

ColumnWhat to Extract
PartiesDisclosing and receiving parties
TypeMutual or one-way
Definition ScopeHow "confidential information" is defined
ExceptionsStandard exceptions to confidentiality
TermDuration of confidentiality obligations
SurvivalSurvival period after termination
Return/DestructionObligations on termination
RemediesAvailable remedies for breach

Troubleshooting

ProblemCauseSolution
Discovery finds 0 documentsWrong path or file typesVerify path exists; check
--types
matches actual file extensions
Extraction JSONs have wrong schemaAgent prompt incompleteUse the extraction schema from
extraction_methodology.md
Aggregator shows conflictsMultiple values for same cellReview source documents; aggregator marks conflicts for manual review
High "NOT FOUND" rateColumns too specific for document typeUse column definitions from
common_extraction_columns.md
; broaden definitions
Confidence all LOWAgent unable to locate valuesCheck column definitions are specific enough; verify document is readable
Aggregator crashes on large setToo many result files loaded at onceProcess in batches of 50 results; use
--columns
to limit output width
Markdown table misalignedLong values or special charactersUse
--format json
for machine processing; truncate long values
Missing citationsAgent did not include page/section referencesReinforce citation requirement in agent prompt; check extraction schema

Success Criteria

  • Extraction Coverage: 90%+ of defined columns populated across all documents
  • Confidence Distribution: 70%+ of extractions rated HIGH confidence
  • Citation Accuracy: Every extracted value includes verifiable page/section citation
  • Processing Speed: 50+ documents processed within 30 minutes using parallel agents
  • Matrix Completeness: Final matrix includes all documents and all columns with no orphan rows

Scope & Limitations

This skill covers:

  • Document inventory and discovery across PDF, DOCX, TXT, and MD formats
  • Aggregation of extraction results from parallel agent processing into unified matrix
  • Pre-defined column sets for contracts, NDAs, employment agreements, and leases
  • Confidence scoring and conflict detection for extracted values
  • Markdown and JSON output formats

This skill does NOT cover:

  • Actual document parsing or text extraction (requires external libraries or AI agents)
  • OCR processing for scanned documents
  • Excel/XLSX output generation (use JSON output and convert externally)
  • Automated legal analysis or risk assessment of extracted values
  • Document comparison or redlining between versions

Anti-Patterns

Anti-PatternWhy It FailsBetter Approach
Vague column definitions"Date" could match dozens of dates in a contractUse specific definitions: "Effective Date" with guidance on where to look
Skipping document discoveryUnknown document count leads to wrong agent allocationAlways run discovery first; use manifest for pipeline planning
Ignoring LOW confidence resultsMissing or uncertain data treated as factReview all LOW confidence cells manually; flag in final report
Processing 100+ docs with 1 agentSlow, context window overflow, quality degradationUse parallel processing: ceil(N/10) documents per agent, max 10 agents
No citation requirementCannot verify extracted values against sourceRequire page/section citation for every extraction; reject uncited values

Tool Reference

scripts/document_discovery.py

Scan directory for legal documents and generate inventory manifest.

usage: document_discovery.py [-h] [--json]
                              [--types TYPES]
                              [--min-size MIN_SIZE]
                              [--max-size MAX_SIZE]
                              directory

positional arguments:
  directory             Path to directory containing documents

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format
  --types TYPES         Comma-separated file extensions to include
                        (default: pdf,docx,doc,txt,md,rtf)
  --min-size MIN_SIZE   Minimum file size in bytes (default: 0)
  --max-size MAX_SIZE   Maximum file size in bytes (default: no limit)

scripts/extraction_aggregator.py

Aggregate extraction results into unified comparison matrix.

usage: extraction_aggregator.py [-h] [--json]
                                 [--results RESULTS [RESULTS ...]]
                                 [--results-dir RESULTS_DIR]
                                 [--format {markdown,json}]
                                 [--columns COLUMNS]
                                 [--output OUTPUT]

options:
  -h, --help            Show help message and exit
  --json                Output in JSON format (alias for --format json)
  --results             One or more extraction result JSON files
  --results-dir         Directory containing extraction result JSON files
  --format              Output format: markdown table or JSON (default: markdown)
  --columns             Comma-separated column names to include (default: all)
  --output              Write output to file instead of stdout