Claude-skills docx-advanced-patterns

Advanced python-docx patterns for nested tables, complex cells, and content extraction beyond .text property. Techniques for forms, checklists, and complex layouts.

install

source · Clone the upstream repo

git clone https://github.com/belumume/claude-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/belumume/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/web-desktop-exports/docx-advanced-patterns" ~/.claude/skills/belumume-claude-skills-docx-advanced-patterns-f60cb4 && rm -rf "$T"

manifest: web-desktop-exports/docx-advanced-patterns/SKILL.md

source content

DOCX Advanced Patterns Skill

Specialized patterns for python-docx that handle complex document structures not covered by basic

.text

extraction.

When to Use This Skill

Invoke this skill when working with DOCX files that have:

Nested tables within table cells
Forms with checkbox options
Complex multi-row cell layouts
Checklists with embedded options
Cell content that doesn't appear with
```
.text
```
property

Use alongside the official

docx

skill for comprehensive document handling.

Core Pattern: Nested Table Extraction

Problem

python-docx's

cell.text

property only extracts direct paragraph text - it does not traverse nested tables within cells.

Symptom:

cell.text  # Returns: '' or '\n'
# But cell visually contains content!

Detection

Check if a cell contains nested tables:

if cell.tables:
    print(f"Found {len(cell.tables)} nested table(s)")
    # Cell has nested content - need special extraction

Solution (Simple)

def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Args:
        cell: python-docx _Cell object

    Returns:
        str: Combined text from cell paragraphs and nested tables
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # For checkbox lists: Column 0 = label, Column 1 = checkbox
                # Extract text from first column only
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['⁮', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''

Solution (Recursive for Deep Nesting)

For documents with multiple levels of table nesting:

def extract_cell_content_recursively(cell):
    """
    Recursively extract text from cell including deeply nested tables.

    Handles arbitrary nesting depth.
    """
    text_parts = []

    def _extract_recursive(cell_obj):
        # Get direct paragraphs
        for para in cell_obj.paragraphs:
            para_text = para.text.strip()
            if para_text and para_text not in ['⁮', '☐', '☑', '☒']:
                text_parts.append(para_text)

        # Recursively get nested tables
        for nested_table in cell_obj.tables:
            for nested_row in nested_table.rows:
                for nested_cell in nested_row.cells:
                    _extract_recursive(nested_cell)

    _extract_recursive(cell)
    return '\n'.join(text_parts) if text_parts else ''

Usage Examples

Example 1: Extracting Form Checkbox Options

Document Structure:

Table Cell contains:
  Nested Table:
    Row 1: "High potential" | ☐
    Row 2: "Moderate potential" | ☐
    Row 3: "Low potential" | ☐

Extraction:

from docx import Document

doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]

# Wrong way - returns empty
basic_text = cell.text
print(basic_text)  # Output: '' or '\n'

# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential

Example 2: Processing All Cells in a Table

def process_table_with_nested_content(table):
    """Process all cells, handling nested tables"""
    for row in table.rows:
        for cell in row.cells:
            # Extract with nested table support
            content = extract_cell_content_with_nested_tables(cell)

            if content:
                # Process content (translate, analyze, etc.)
                processed = do_something_with(content)
                print(f"Cell content: {processed}")

Example 3: Detecting Nested Tables

def analyze_document_structure(doc):
    """Find all cells with nested tables"""
    nested_cells = []

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    nested_cells.append({
                        'table': t_idx,
                        'row': r_idx,
                        'col': c_idx,
                        'nested_count': len(cell.tables)
                    })

    return nested_cells

# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)

for item in nested:
    print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
          f"{item['nested_count']} nested table(s)")

Common Use Cases

1. Government Forms

Forms often use nested tables for checkbox grids:

def extract_form_responses(doc):
    """Extract all form checkbox options"""
    responses = {}

    for table in doc.tables:
        for row in table.rows:
            # First cell = question
            question = row.cells[0].text.strip()

            # Second cell = checkbox options (nested table)
            if row.cells[1].tables:
                options = extract_cell_content_with_nested_tables(row.cells[1])
                responses[question] = options.split('\n')

    return responses

2. Evaluation Forms

Extract rating scales and options:

def extract_evaluation_items(doc):
    """Extract evaluation criteria and options"""
    evaluations = []

    for table in doc.tables:
        for row_idx, row in enumerate(table.rows[1:], 1):
            # Get criterion
            criterion = row.cells[0].text.strip()

            # Get rating options (often nested)
            rating_cell = row.cells[1]
            rating_options = extract_cell_content_with_nested_tables(rating_cell)

            evaluations.append({
                'criterion': criterion,
                'options': rating_options.split('\n')
            })

    return evaluations

3. Complex Data Tables

Extract structured data from cells with nested layouts:

def extract_complex_cell_data(cell):
    """Extract data from cells with complex nested structures"""
    data = {
        'main_content': '',
        'nested_items': []
    }

    # Direct paragraphs
    for para in cell.paragraphs:
        if para.text.strip():
            data['main_content'] = para.text.strip()
            break

    # Nested table data
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                row_data = [c.text.strip() for c in nested_row.cells]
                data['nested_items'].append(row_data)

    return data

Integration with Official docx Skill

This skill complements the official docx skill:

Official docx skill provides:

Document creation (docx-js)
Basic text extraction (pandoc)
Tracked changes workflows
Comment handling
XML access for complex cases

This skill provides:

Nested table extraction
Complex cell content handling
Form and checklist processing
Advanced content extraction patterns

Use together:

# For basic operations: use official skill
from docx import Document

# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables

# Combine both
doc = Document('complex_form.docx')  # Official
for table in doc.tables:            # Official
    for row in table.rows:          # Official
        for cell in row.cells:      # Official
            # Advanced extraction:
            content = extract_cell_content_with_nested_tables(cell)

Performance Considerations

For Large Documents:

Cache nested table checks:

def build_nested_table_cache(doc):
    """Pre-compute which cells have nested tables"""
    cache = {}

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    cache[(t_idx, r_idx, c_idx)] = len(cell.tables)

    return cache

# Usage
cache = build_nested_table_cache(doc)

for t_idx, table in enumerate(doc.tables):
    for r_idx, row in enumerate(table.rows):
        for c_idx, cell in enumerate(row.cells):
            if (t_idx, r_idx, c_idx) in cache:
                # This cell has nested tables
                content = extract_cell_content_with_nested_tables(cell)
            else:
                # Regular extraction
                content = cell.text

Troubleshooting

Issue: Extraction returns empty despite visible content

Diagnosis:

cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")

if not cell.text.strip() and cell.tables:
    print("Content is in nested tables!")

Fix: Use

extract_cell_content_with_nested_tables(cell)

Issue: Checkbox characters (⁮, ☐) appear in output

Fix: Filter them out:

text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('⁮', '').replace('☐', '').replace('☑', '').replace('☒', '')

Issue: Multi-line content not preserved

Fix: Join with newlines:

'\n'.join(text_parts)  # Preserves line structure

Best Practices

Always check for nested tables first:

if cell.tables:
    content = extract_cell_content_with_nested_tables(cell)
else:
    content = cell.text

Handle checkbox characters:

CHECKBOX_CHARS = ['⁮', '☐', '☑', '☒']
if text not in CHECKBOX_CHARS:
    # Process text

Preserve structure:

# Use newlines to maintain line breaks
'\n'.join(lines)

Test with sample documents:

def test_extraction():
    doc = Document('sample_form.docx')
    cell = doc.tables[0].rows[1].cells[0]

    extracted = extract_cell_content_with_nested_tables(cell)
    assert 'High potential' in extracted
    assert 'Moderate potential' in extracted

Reference Implementation

See

REFERENCE.md

for:

Complete working examples
Integration patterns
Advanced recursive extraction
Performance optimization techniques

Contributing to Anthropic Skills

This pattern is not currently in the official

docx

skill. If you find it useful, consider contributing:

Fork https://github.com/anthropics/skills
Add to
```
document-skills/docx/SKILL.md
```
Submit pull request with:
- Pattern description
- Code examples
- Use cases

Success Criteria

Pattern is working if:

Cells with nested tables return full content
Checkbox options are extracted correctly
Form fields are readable
No content is lost during extraction
Structure is preserved (line breaks maintained)