Claude-skill-registry large-document-processing
Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Use when working with complex formatted documents, multi-level hierarchies, or when you need to extract structured data from large files like PDFs, DOCX, or text files.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/large-document-processing" ~/.claude/skills/majiayu000-claude-skill-registry-large-document-processing && rm -rf "$T"
manifest:
skills/data/large-document-processing/SKILL.mdsource content
Large Document Processing
Overview
A comprehensive skill for processing large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Designed for documents with complex formatting, hierarchical structures, and multi-level indentation.
Capabilities
- Multi-format Support: DOCX, PDF, and text files
- Structure Preservation: Maintains document hierarchy, indentation, and formatting
- Memory Efficiency: Chunked processing to handle very large documents
- Intelligent Parsing: Recognizes headings, lists, dictionary entries, and semantic boundaries
- Progress Tracking: Real-time processing status and error recovery
- Metadata Extraction: Comprehensive document analysis and statistics
Core Components
1. Advanced Document Parser
Parse complex document structures while preserving formatting and hierarchy.
Key Features:
- Hierarchical structure detection (levels 1-10)
- Formatting preservation (bold, italic, fonts, sizes)
- Page-by-page processing for memory efficiency
- Intelligent content classification
- Multi-language support with accent character handling
2. Implementation Pattern
from .large_document_processor import LargeDocumentProcessor, ProcessingConfig # Configure processing config = ProcessingConfig( chunk_size_pages=50, parallel_workers=4, preserve_formatting=True ) # Initialize processor processor = LargeDocumentProcessor(config) # Process document results = processor.process_large_document( input_file="large_document.docx", output_dir="output/processed" )
3. Intelligent Text Chunking
from .intelligent_chunker import IntelligentTextChunker, ChunkType chunker = IntelligentTextChunker( max_chunk_size=1024, overlap_ratio=0.15, preserve_sentences=True ) chunks = chunker.chunk_document(text, ChunkType.SEMANTIC)
Output Formats
- Structured JSON: Complete document hierarchy and metadata
- Plain text: Clean extracted text with optional formatting markers
- Chunked data: AI-ready text segments with overlap and metadata
- Statistics report: Processing metrics and quality analysis
Best Practices
- Memory Management: Use chunked processing for documents >100MB
- Parallel Processing: Leverage multiple workers for batch operations
- Structure Validation: Verify hierarchy detection accuracy
- Progress Tracking: Provide user feedback for long-running operations
Dependencies
: DOCX file processingpython-docx
: Advanced PDF processingPyMuPDF
: Image processing for embedded contentPillow
: Cross-platform path handlingpathlib