Claude-skills docx-advanced-patterns
Advanced python-docx patterns for nested tables, complex cells, and content extraction beyond .text property. Techniques for forms, checklists, and complex layouts.
git clone https://github.com/belumume/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/belumume/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/web-desktop-exports/docx-advanced-patterns" ~/.claude/skills/belumume-claude-skills-docx-advanced-patterns-f60cb4 && rm -rf "$T"
web-desktop-exports/docx-advanced-patterns/SKILL.mdDOCX Advanced Patterns Skill
Specialized patterns for python-docx that handle complex document structures not covered by basic
.text extraction.
When to Use This Skill
Invoke this skill when working with DOCX files that have:
- Nested tables within table cells
- Forms with checkbox options
- Complex multi-row cell layouts
- Checklists with embedded options
- Cell content that doesn't appear with
property.text
Use alongside the official
docx skill for comprehensive document handling.
Core Pattern: Nested Table Extraction
Problem
python-docx's
cell.text property only extracts direct paragraph text - it does not traverse nested tables within cells.
Symptom:
cell.text # Returns: '' or '\n' # But cell visually contains content!
Detection
Check if a cell contains nested tables:
if cell.tables: print(f"Found {len(cell.tables)} nested table(s)") # Cell has nested content - need special extraction
Solution (Simple)
def extract_cell_content_with_nested_tables(cell): """ Extract all text from a cell, including text from nested tables. Args: cell: python-docx _Cell object Returns: str: Combined text from cell paragraphs and nested tables """ text_parts = [] # Get direct paragraph text (not inside nested tables) for para in cell.paragraphs: para_text = para.text.strip() if para_text: text_parts.append(para_text) # Get content from nested tables if cell.tables: for nested_table in cell.tables: for nested_row in nested_table.rows: # For checkbox lists: Column 0 = label, Column 1 = checkbox # Extract text from first column only if nested_row.cells: first_col_text = nested_row.cells[0].text.strip() # Filter out checkbox characters if first_col_text and first_col_text not in ['', '☐', '☑', '☒']: text_parts.append(first_col_text) return '\n'.join(text_parts) if text_parts else ''
Solution (Recursive for Deep Nesting)
For documents with multiple levels of table nesting:
def extract_cell_content_recursively(cell): """ Recursively extract text from cell including deeply nested tables. Handles arbitrary nesting depth. """ text_parts = [] def _extract_recursive(cell_obj): # Get direct paragraphs for para in cell_obj.paragraphs: para_text = para.text.strip() if para_text and para_text not in ['', '☐', '☑', '☒']: text_parts.append(para_text) # Recursively get nested tables for nested_table in cell_obj.tables: for nested_row in nested_table.rows: for nested_cell in nested_row.cells: _extract_recursive(nested_cell) _extract_recursive(cell) return '\n'.join(text_parts) if text_parts else ''
Usage Examples
Example 1: Extracting Form Checkbox Options
Document Structure:
Table Cell contains: Nested Table: Row 1: "High potential" | ☐ Row 2: "Moderate potential" | ☐ Row 3: "Low potential" | ☐
Extraction:
from docx import Document doc = Document('form.docx') table = doc.tables[0] cell = table.rows[1].cells[0] # Wrong way - returns empty basic_text = cell.text print(basic_text) # Output: '' or '\n' # Right way - extracts nested content full_text = extract_cell_content_with_nested_tables(cell) print(full_text) # Output: # High potential # Moderate potential # Low potential
Example 2: Processing All Cells in a Table
def process_table_with_nested_content(table): """Process all cells, handling nested tables""" for row in table.rows: for cell in row.cells: # Extract with nested table support content = extract_cell_content_with_nested_tables(cell) if content: # Process content (translate, analyze, etc.) processed = do_something_with(content) print(f"Cell content: {processed}")
Example 3: Detecting Nested Tables
def analyze_document_structure(doc): """Find all cells with nested tables""" nested_cells = [] for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if cell.tables: nested_cells.append({ 'table': t_idx, 'row': r_idx, 'col': c_idx, 'nested_count': len(cell.tables) }) return nested_cells # Usage doc = Document('complex_form.docx') nested = analyze_document_structure(doc) for item in nested: print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: " f"{item['nested_count']} nested table(s)")
Common Use Cases
1. Government Forms
Forms often use nested tables for checkbox grids:
def extract_form_responses(doc): """Extract all form checkbox options""" responses = {} for table in doc.tables: for row in table.rows: # First cell = question question = row.cells[0].text.strip() # Second cell = checkbox options (nested table) if row.cells[1].tables: options = extract_cell_content_with_nested_tables(row.cells[1]) responses[question] = options.split('\n') return responses
2. Evaluation Forms
Extract rating scales and options:
def extract_evaluation_items(doc): """Extract evaluation criteria and options""" evaluations = [] for table in doc.tables: for row_idx, row in enumerate(table.rows[1:], 1): # Get criterion criterion = row.cells[0].text.strip() # Get rating options (often nested) rating_cell = row.cells[1] rating_options = extract_cell_content_with_nested_tables(rating_cell) evaluations.append({ 'criterion': criterion, 'options': rating_options.split('\n') }) return evaluations
3. Complex Data Tables
Extract structured data from cells with nested layouts:
def extract_complex_cell_data(cell): """Extract data from cells with complex nested structures""" data = { 'main_content': '', 'nested_items': [] } # Direct paragraphs for para in cell.paragraphs: if para.text.strip(): data['main_content'] = para.text.strip() break # Nested table data if cell.tables: for nested_table in cell.tables: for nested_row in nested_table.rows: row_data = [c.text.strip() for c in nested_row.cells] data['nested_items'].append(row_data) return data
Integration with Official docx Skill
This skill complements the official docx skill:
Official docx skill provides:
- Document creation (docx-js)
- Basic text extraction (pandoc)
- Tracked changes workflows
- Comment handling
- XML access for complex cases
This skill provides:
- Nested table extraction
- Complex cell content handling
- Form and checklist processing
- Advanced content extraction patterns
Use together:
# For basic operations: use official skill from docx import Document # For nested table handling: use this skill from docx_advanced import extract_cell_content_with_nested_tables # Combine both doc = Document('complex_form.docx') # Official for table in doc.tables: # Official for row in table.rows: # Official for cell in row.cells: # Official # Advanced extraction: content = extract_cell_content_with_nested_tables(cell)
Performance Considerations
For Large Documents:
Cache nested table checks:
def build_nested_table_cache(doc): """Pre-compute which cells have nested tables""" cache = {} for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if cell.tables: cache[(t_idx, r_idx, c_idx)] = len(cell.tables) return cache # Usage cache = build_nested_table_cache(doc) for t_idx, table in enumerate(doc.tables): for r_idx, row in enumerate(table.rows): for c_idx, cell in enumerate(row.cells): if (t_idx, r_idx, c_idx) in cache: # This cell has nested tables content = extract_cell_content_with_nested_tables(cell) else: # Regular extraction content = cell.text
Troubleshooting
Issue: Extraction returns empty despite visible content
Diagnosis:
cell = table.rows[1].cells[0] print(f"cell.text: '{cell.text}'") print(f"cell.tables: {len(cell.tables)}") if not cell.text.strip() and cell.tables: print("Content is in nested tables!")
Fix: Use
extract_cell_content_with_nested_tables(cell)
Issue: Checkbox characters (, ☐) appear in output
Fix: Filter them out:
text = cell.text.strip() # Remove checkbox unicode characters clean_text = text.replace('', '').replace('☐', '').replace('☑', '').replace('☒', '')
Issue: Multi-line content not preserved
Fix: Join with newlines:
'\n'.join(text_parts) # Preserves line structure
Best Practices
-
Always check for nested tables first:
if cell.tables: content = extract_cell_content_with_nested_tables(cell) else: content = cell.text -
Handle checkbox characters:
CHECKBOX_CHARS = ['', '☐', '☑', '☒'] if text not in CHECKBOX_CHARS: # Process text -
Preserve structure:
# Use newlines to maintain line breaks '\n'.join(lines) -
Test with sample documents:
def test_extraction(): doc = Document('sample_form.docx') cell = doc.tables[0].rows[1].cells[0] extracted = extract_cell_content_with_nested_tables(cell) assert 'High potential' in extracted assert 'Moderate potential' in extracted
Reference Implementation
See
REFERENCE.md for:
- Complete working examples
- Integration patterns
- Advanced recursive extraction
- Performance optimization techniques
Contributing to Anthropic Skills
This pattern is not currently in the official
docx skill. If you find it useful, consider contributing:
- Fork https://github.com/anthropics/skills
- Add to
document-skills/docx/SKILL.md - Submit pull request with:
- Pattern description
- Code examples
- Use cases
Success Criteria
Pattern is working if:
- Cells with nested tables return full content
- Checkbox options are extracted correctly
- Form fields are readable
- No content is lost during extraction
- Structure is preserved (line breaks maintained)