Claude-skill-registry Document Parser

Parse large documents into structured sections with abstracts and metadata

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-parser" ~/.claude/skills/majiayu000-claude-skill-registry-document-parser && rm -rf "$T"

manifest: skills/data/document-parser/SKILL.md

Document Parser

Overview

This skill provides tools and workflows for parsing large documents that exceed context limits. It extracts hierarchical structure, generates section abstracts, and extracts metadata using layout-aware hierarchical chunking principles optimized for RAG systems.

Core principle: Preserve semantic structure while chunking documents into 400-900 token sections with rich metadata for retrieval and comprehension.

When to Use This Skill

Use this skill when:

Document exceeds 25k+ tokens and can't fit in context
User explicitly requests document parsing or structure extraction
Building RAG systems that need semantically coherent chunks
Analyzing research papers, technical docs, or long-form content
Need to extract tables, code blocks, benchmarks, or key terms
Want progressive reading (abstracts first, then deep-dives)
Comparing multiple large documents

Don't use for:

❌ Documents under 10k tokens (read directly instead)
❌ Binary file formats (PDFs, Word docs) - convert to markdown first
❌ Simple text extraction (use grep/awk instead)

Core Capabilities

The document-parser skill provides four main capabilities:

Structure Analysis
- Extract markdown headers (H1-H6)
- Build hierarchical section tree
- Count tokens per section (target: 400-900)
- Generate section maps for navigation
Abstract Generation
- Create 100-200 token summaries for major sections
- Preserve key concepts and relationships
- Enable progressive reading workflows
Metadata Extraction
- Extract tables with structure preservation
- Capture code blocks with language tags
- Identify benchmarks (percentages, metrics)
- Extract key terms (techniques, models, acronyms)
Output Generation
- Machine-readable JSON (structure.json, metadata.json)
- Human-readable markdown (section_map.md)
- Full section content with metadata

Quick Reference

Task	Command	Output
Parse structure	`python3 scripts/parse_document_structure.py <file.md>`	structure.json, section_map.md
Extract metadata	`python3 scripts/extract_metadata.py <file.md>`	metadata.json
Custom output path	`--output <path>`	Specify output file
Section map	`--map <path>`	Human-readable navigation

Chunking Principles Reference

The skill implements RAG-optimized chunking principles:

The 400-900 Token Sweet Spot

Too small (<400): Fragments semantic meaning, loses context
Sweet spot (400-900): Complete thoughts, searchable, coherent
Too large (>900): Dilutes relevance, adds noise

Layout-Aware Hierarchical Chunking

Respect document structure (headers, sections)
Never split mid-paragraph or mid-code-block
Preserve parent-child relationships
Include breadcrumb context (section path)

Dual-Storage Pattern

Abstracts: Quick navigation, relevance filtering
Full sections: Deep-dive when needed
Metadata: Tables, benchmarks, key terms for targeted search

See

references/chunking_principles.md

for complete details.

Sandbox Configuration

IMPORTANT: This skill requires executing Python scripts. In read-only sandbox mode, you need to either:

Recommended: Configure sandbox allowlist in

~/.codex/config.toml

[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]

Alternative: Use
```
dangerouslyDisableSandbox: true
```
when calling Bash tool

See

README.md

in this skill directory for complete sandbox setup instructions.

Implementation Workflows

Workflow 1: Parse Single Large Document

Use case: User has a 47k token research paper

# Step 1: Parse document structure
cd ~/.codex/skills/document-parser
python3 scripts/parse_document_structure.py /path/to/document.md \
  --output structure.json \
  --map section_map.md

# Step 2: Review section map
cat section_map.md
# Shows hierarchical outline with token counts

# Step 3: Extract metadata
python3 scripts/extract_metadata.py /path/to/document.md \
  --output metadata.json

# Step 4: Review extracted metadata
cat metadata.json | jq '.tables | length'
cat metadata.json | jq '.benchmarks | length'
cat metadata.json | jq '.key_terms | keys'

Expected output:

```
structure.json
```
: Hierarchical section tree with token counts
```
section_map.md
```
: Human-readable outline for navigation
```
metadata.json
```
: Tables, code blocks, benchmarks, key terms

Workflow 2: Comparative Analysis

Use case: Compare two research papers on similar topics

# Parse both documents
for doc in paper1.md paper2.md; do
  python3 scripts/parse_document_structure.py "$doc" \
    --output "${doc%.md}_structure.json"
  python3 scripts/extract_metadata.py "$doc" \
    --output "${doc%.md}_metadata.json"
done

# Compare structures
diff -u \
  <(jq '.sections[] | .title' paper1_structure.json) \
  <(jq '.sections[] | .title' paper2_structure.json)

# Compare key terms
diff -u \
  <(jq '.key_terms.techniques[]' paper1_metadata.json | sort) \
  <(jq '.key_terms.techniques[]' paper2_metadata.json | sort)

Workflow 3: Progressive Document Reading

Use case: Understand document before deep-dive

# Step 1: Get high-level structure
python3 scripts/parse_document_structure.py document.md --map outline.md
cat outline.md
# Review: What are the main sections?

# Step 2: Read abstracts (if available in structure.json)
jq '.sections[] | select(.abstract) | {title, abstract}' structure.json

# Step 3: Extract metadata for context
python3 scripts/extract_metadata.py document.md --output metadata.json

# Step 4: Review key terms to understand domain
jq '.key_terms' metadata.json

# Step 5: Deep-dive into specific sections
# Read full sections from original document based on structure

Script Documentation

parse_document_structure.py

Extracts markdown headers, builds hierarchical section tree, counts tokens.

Usage:

python3 scripts/parse_document_structure.py <file.md> [OPTIONS]

Options:

```
--output FILEPATH
```
- Output JSON file (default: structure.json)
```
--map FILEPATH
```
- Output markdown section map (default: section_map.md)

Output structure.json format:

{
  "sections": [
    {
      "id": "section-1",
      "title": "Introduction",
      "level": 1,
      "token_count": 450,
      "children": [
        {
          "id": "section-1.1",
          "title": "Background",
          "level": 2,
          "token_count": 320,
          "children": []
        }
      ]
    }
  ],
  "total_sections": 56,
  "total_tokens": 47000
}

Output section_map.md format:

# Document Structure

- Introduction (450 tokens)
  - Background (320 tokens)
  - Motivation (280 tokens)
- Methods (650 tokens)
  - Data Collection (520 tokens)
  - Analysis (580 tokens)

extract_metadata.py

Extracts tables, code blocks, benchmarks, and key terms.

Usage:

python3 scripts/extract_metadata.py <file.md> [OPTIONS]

Options:

```
--output FILEPATH
```
- Output JSON file (default: metadata.json)

Output metadata.json format:

{
  "tables": [
    {
      "id": "table-1",
      "section": "Results",
      "headers": ["Model", "Accuracy", "F1"],
      "rows": [
        ["GPT-4", "95.2%", "0.94"],
        ["Claude", "94.8%", "0.93"]
      ]
    }
  ],
  "code_blocks": [
    {
      "id": "code-1",
      "section": "Implementation",
      "language": "python",
      "content": "def parse_document(text):\n    ..."
    }
  ],
  "benchmarks": [
    {
      "metric": "Accuracy",
      "value": "95.2%",
      "context": "GPT-4 on MMLU benchmark"
    }
  ],
  "key_terms": {
    "techniques": ["RAG", "Fine-tuning", "Few-shot learning"],
    "models": ["GPT-4", "Claude", "Llama-2"],
    "acronyms": ["MMLU", "RAG", "NLP"]
  }
}

Common Mistakes

❌ Sandbox permission errors when running scripts

Problem:

Permission denied

or scripts won't execute in read-only sandbox mode Fix: Configure sandbox allowlist in

~/.codex/config.toml

[sandbox]
allowed_paths = ["~/.codex/skills/*/scripts"]

Or use

dangerouslyDisableSandbox: true

flag when calling Bash tool (development only). See

README.md

for complete setup instructions.

❌ Parsing non-markdown files

Problem: Scripts expect markdown format Fix: Convert PDFs/Word docs to markdown first using pandoc:

pandoc document.pdf -o document.md

❌ Ignoring token counts

Problem: Sections too large for embedding models Fix: Review section_map.md token counts, split sections >900 tokens manually

❌ Missing Python dependencies

Problem: Scripts require specific libraries Fix: Install dependencies:

pip install tiktoken markdown beautifulsoup4

❌ Not preserving structure

Problem: Flat extraction loses context Fix: Always use hierarchical parsing, maintain parent-child relationships

❌ Skipping metadata extraction

Problem: Lose valuable structured data Fix: Always run both scripts for complete analysis

Examples

Example 1: Research Paper (47k tokens)

Input: 47k token research paper on RAG systems

Commands:

python3 scripts/parse_document_structure.py rag_paper.md
python3 scripts/extract_metadata.py rag_paper.md

Results:

56 sections extracted
54 tables identified
145 benchmarks found
71 techniques cataloged
Section map showing 3-level hierarchy
Average section size: 839 tokens (within target range)

Example 2: Technical Documentation

Input: API documentation with code examples

Commands:

python3 scripts/parse_document_structure.py api_docs.md --map api_outline.md
python3 scripts/extract_metadata.py api_docs.md

Use results to:

Navigate API structure via outline
Extract all code examples for testing
Catalog all endpoints from tables
Build searchable knowledge base

Example 3: Multi-Document Comparison

Input: 3 papers on LLM evaluation

Workflow:

# Parse all documents
for doc in paper*.md; do
  python3 scripts/parse_document_structure.py "$doc"
  python3 scripts/extract_metadata.py "$doc"
done

# Compare methodologies
jq -r '.sections[] | select(.title | contains("Method")) | .title' *_structure.json

# Compare benchmarks
jq -r '.benchmarks[] | select(.metric == "Accuracy") | "\(.value) - \(.context)"' *_metadata.json

Testing Your Parsing

After parsing a document, verify quality:

Structure Checklist:

All major sections captured
Hierarchy preserved (H1 > H2 > H3)
Token counts reasonable (400-900 target)
Section map is human-readable
JSON is valid (
```
jq . structure.json
```
)

Metadata Checklist:

Tables extracted with structure
Code blocks include language tags
Benchmarks capture value + context
Key terms are domain-relevant
JSON is valid (
```
jq . metadata.json
```
)

Advanced Usage

Custom Section Splitting

If sections are too large (>900 tokens), split manually:

# In parse_document_structure.py, add target_size parameter
python3 scripts/parse_document_structure.py document.md \
  --target-size 600 \
  --max-size 900

Filtering by Section Level

Extract only top-level sections:

jq '.sections[] | select(.level == 1)' structure.json

Building RAG Index

Use parsed output for RAG system:

import json

# Load structure
with open('structure.json') as f:
    structure = json.load(f)

# Load metadata
with open('metadata.json') as f:
    metadata = json.load(f)

# Build embeddings for each section
for section in structure['sections']:
    if 400 <= section['token_count'] <= 900:
        # Optimal chunk size
        embed_and_index(section)

Integration with Other Skills

This skill complements:

skill-builder: Create new parsing strategies as skills
time-awareness: Track document parsing timestamps

Proven Success

Tested successfully on:

✅ 47K token research document
✅ 56 sections extracted
✅ 54 tables preserved
✅ 145 benchmarks identified
✅ 71 techniques cataloged
✅ Hierarchical section maps generated
✅ Metadata JSON validated

References

```
references/chunking_principles.md
```
- Complete RAG chunking methodology
Scripts in
```
scripts/
```
directory
See skill-builder for creating document-specific parsing skills

Remember: Large documents are structured data. Parse the structure first, then read strategically.