Claude-skill-registry Document Parser
Parse large documents into structured sections with abstracts and metadata
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-parser" ~/.claude/skills/majiayu000-claude-skill-registry-document-parser && rm -rf "$T"
skills/data/document-parser/SKILL.mdDocument Parser
Overview
This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.
Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.
When to Use This Skill
Use this skill when:
- Document exceeds 25k+ tokens and can't fit in context
- User explicitly requests document parsing or structure extraction
- Building RAG systems that need semantically coherent chunks
- Analyzing research papers, technical docs, or long-form content
- Need to extract tables, code blocks, benchmarks, or key terms
- Want progressive reading (abstracts first, then deep-dives)
- Comparing multiple large documents
Don't use for:
- ❌ Documents under 10k tokens (read directly instead)
- ❌ Binary file formats (PDFs, Word docs) - convert to markdown first
- ❌ Simple text extraction (use grep/awk instead)
Core Capabilities
The document-parser skill provides four main capabilities:
-
Structure Analysis
- Extract markdown headers (H1-H6)
- Build hierarchical section tree
- Count tokens per section (target: 400-900)
- Generate section maps for navigation
-
Abstract Generation
- Create 100-200 token summaries for major sections
- Preserve key concepts and relationships
- Enable progressive reading workflows
-
Metadata Extraction
- Extract tables with structure preservation
- Capture code blocks with language tags
- Identify benchmarks (percentages, metrics)
- Extract key terms (techniques, models, acronyms)
-
Output Generation
- Machine-readable JSON (structure.json, metadata.json)
- Human-readable markdown (section_map.md)
- Full section content with metadata
Quick Reference
| Task | Command | Output |
|---|---|---|
| Parse structure | | structure.json, section_map.md |
| Extract metadata | | metadata.json |
| Custom output path | | Specify output file |
| Section map | | Human-readable navigation |
Chunking Principles Reference
The skill implements RAG-optimized chunking principles:
The 400-900 Token Sweet Spot
- Too small (<400): Fragments semantic meaning, loses context
- Sweet spot (400-900): Complete thoughts, searchable, coherent
- Too large (>900): Dilutes relevance, adds noise
Layout-Aware Hierarchical Chunking
- Respect document structure (headers, sections)
- Never split mid-paragraph or mid-code-block
- Preserve parent-child relationships
- Include breadcrumb context (section path)
Dual-Storage Pattern
- Abstracts: Quick navigation, relevance filtering
- Full sections: Deep-dive when needed
- Metadata: Tables, benchmarks, key terms for targeted search
See
references/chunking_principles.md for complete details.
Sandbox Configuration
IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:
-
Recommended: Configure sandbox allowlist in
:~/.codex/config.toml[sandbox] allowed_paths = ["~/.codex/skills/*/scripts"] -
Alternative: Use
when calling Bash tooldangerouslyDisableSandbox: true
See
README.md in this skill directory for complete sandbox setup instructions.
Implementation Workflows
Workflow 1: Parse Single Large Document
Use case: User has a 47k token research paper
# Step 1: Parse document structure cd ~/.codex/skills/document-parser python3 scripts/parse_document_structure.py /path/to/document.md \ --output structure.json \ --map section_map.md # Step 2: Review section map cat section_map.md # Shows hierarchical outline with token counts # Step 3: Extract metadata python3 scripts/extract_metadata.py /path/to/document.md \ --output metadata.json # Step 4: Review extracted metadata cat metadata.json | jq '.tables | length' cat metadata.json | jq '.benchmarks | length' cat metadata.json | jq '.key_terms | keys'
Expected output:
: Hierarchical section tree with token countsstructure.json
: Human-readable outline for navigationsection_map.md
: Tables, code blocks, benchmarks, key termsmetadata.json
Workflow 2: Comparative Analysis
Use case: Compare two research papers on similar topics
# Parse both documents for doc in paper1.md paper2.md; do python3 scripts/parse_document_structure.py "$doc" \ --output "${doc%.md}_structure.json" python3 scripts/extract_metadata.py "$doc" \ --output "${doc%.md}_metadata.json" done # Compare structures diff -u \ <(jq '.sections[] | .title' paper1_structure.json) \ <(jq '.sections[] | .title' paper2_structure.json) # Compare key terms diff -u \ <(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \ <(jq '.key_terms.techniques[]' paper2_metadata.json | sort)
Workflow 3: Progressive Document Reading
Use case: Understand document before deep-dive
# Step 1: Get high-level structure python3 scripts/parse_document_structure.py document.md --map outline.md cat outline.md # Review: What are the main sections? # Step 2: Read abstracts (if available in structure.json) jq '.sections[] | select(.abstract) | {title, abstract}' structure.json # Step 3: Extract metadata for context python3 scripts/extract_metadata.py document.md --output metadata.json # Step 4: Review key terms to understand domain jq '.key_terms' metadata.json # Step 5: Deep-dive into specific sections # Read full sections from original document based on structure
Script Documentation
parse_document_structure.py
Extracts markdown headers, builds hierarchical section tree, counts tokens.
Usage:
python3 scripts/parse_document_structure.py <file.md> [OPTIONS]
Options:
- Output JSON file (default: structure.json)--output FILEPATH
- Output markdown section map (default: section_map.md)--map FILEPATH
Output structure.json format:
{ "sections": [ { "id": "section-1", "title": "Introduction", "level": 1, "token_count": 450, "children": [ { "id": "section-1.1", "title": "Background", "level": 2, "token_count": 320, "children": [] } ] } ], "total_sections": 56, "total_tokens": 47000 }
Output section_map.md format:
# Document Structure - Introduction (450 tokens) - Background (320 tokens) - Motivation (280 tokens) - Methods (650 tokens) - Data Collection (520 tokens) - Analysis (580 tokens)
extract_metadata.py
Extracts tables, code blocks, benchmarks, and key terms.
Usage:
python3 scripts/extract_metadata.py <file.md> [OPTIONS]
Options:
- Output JSON file (default: metadata.json)--output FILEPATH
Output metadata.json format:
{ "tables": [ { "id": "table-1", "section": "Results", "headers": ["Model", "Accuracy", "F1"], "rows": [ ["GPT-4", "95.2%", "0.94"], ["Claude", "94.8%", "0.93"] ] } ], "code_blocks": [ { "id": "code-1", "section": "Implementation", "language": "python", "content": "def parse_document(text):\n ..." } ], "benchmarks": [ { "metric": "Accuracy", "value": "95.2%", "context": "GPT-4 on MMLU benchmark" } ], "key_terms": { "techniques": ["RAG", "Fine-tuning", "Few-shot learning"], "models": ["GPT-4", "Claude", "Llama-2"], "acronyms": ["MMLU", "RAG", "NLP"] } }
Common Mistakes
❌ Sandbox permission errors when running scripts
Problem:
Permission denied or scripts won't execute in read-only sandbox mode
Fix: Configure sandbox allowlist in ~/.codex/config.toml:
[sandbox] allowed_paths = ["~/.codex/skills/*/scripts"]
Or use
dangerouslyDisableSandbox: true flag when calling Bash tool (development only).
See README.md for complete setup instructions.
❌ Parsing non-markdown files
Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:
pandoc document.pdf -o document.md
❌ Ignoring token counts
Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually
❌ Missing Python dependencies
Problem: Scripts require specific libraries Fix: Install dependencies:
pip install tiktoken markdown beautifulsoup4
❌ Not preserving structure
Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships
❌ Skipping metadata extraction
Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis
Examples
Example 1: Research Paper (47k tokens)
Input: 47k token research paper on RAG systems
Commands:
python3 scripts/parse_document_structure.py rag_paper.md python3 scripts/extract_metadata.py rag_paper.md
Results:
- 56 sections extracted
- 54 tables identified
- 145 benchmarks found
- 71 techniques cataloged
- Section map showing 3-level hierarchy
- Average section size: 839 tokens (within target range)
Example 2: Technical Documentation
Input: API documentation with code examples
Commands:
python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md python3 scripts/extract_metadata.py api_docs.md
Use results to:
- Navigate API structure via outline
- Extract all code examples for testing
- Catalog all endpoints from tables
- Build searchable knowledge base
Example 3: Multi-Document Comparison
Input: 3 papers on LLM evaluation
Workflow:
# Parse all documents for doc in paper*.md; do python3 scripts/parse_document_structure.py "$doc" python3 scripts/extract_metadata.py "$doc" done # Compare methodologies jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json # Compare benchmarks jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json
Testing Your Parsing
After parsing a document, verify quality:
Structure Checklist:
- All major sections captured
- Hierarchy preserved (H1 > H2 > H3)
- Token counts reasonable (400-900 target)
- Section map is human-readable
- JSON is valid (
)jq . structure.json
Metadata Checklist:
- Tables extracted with structure
- Code blocks include language tags
- Benchmarks capture value + context
- Key terms are domain-relevant
- JSON is valid (
)jq . metadata.json
Advanced Usage
Custom Section Splitting
If sections are too large (>900 tokens), split manually:
# In parse_document_structure.py, add target_size parameter python3 scripts/parse_document_structure.py document.md \ --target-size 600 \ --max-size 900
Filtering by Section Level
Extract only top-level sections:
jq '.sections[] | select(.level == 1)' structure.json
Building RAG Index
Use parsed output for RAG system:
import json # Load structure with open('structure.json') as f: structure = json.load(f) # Load metadata with open('metadata.json') as f: metadata = json.load(f) # Build embeddings for each section for section in structure['sections']: if 400 <= section['token_count'] <= 900: # Optimal chunk size embed_and_index(section)
Integration with Other Skills
This skill complements:
- skill-builder: Create new parsing strategies as skills
- time-awareness: Track document parsing timestamps
Proven Success
Tested successfully on:
- ✅ 47K token research document
- ✅ 56 sections extracted
- ✅ 54 tables preserved
- ✅ 145 benchmarks identified
- ✅ 71 techniques cataloged
- ✅ Hierarchical section maps generated
- ✅ Metadata JSON validated
References
- Complete RAG chunking methodologyreferences/chunking_principles.md- Scripts in
directoryscripts/ - See skill-builder for creating document-specific parsing skills
Remember: Large documents are structured data. Parse the structure first, then read strategically.