OpenSpace excel-debug-extraction

Iteratively debug Excel structure with exploratory scripts before writing extraction logic

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/excel-debug-extraction" ~/.claude/skills/hkuds-openspace-excel-debug-extraction && rm -rf "$T"

manifest: gdpval_bench/skills/excel-debug-extraction/SKILL.md

source content

Excel Debug-First Extraction Workflow

When working with poorly-structured, complex, or unfamiliar Excel files, use this iterative debugging approach to map the data layout before writing your final extraction logic.

When to Use This Skill

Excel files with inconsistent formatting or merged cells
Files received from external sources with unknown structure
Complex workbooks with multiple sheets and interdependencies
When initial parsing attempts fail or produce unexpected results

Workflow Steps

Step 1: Initial Structure Reconnaissance

Before writing extraction logic, create a debug script to explore the file structure:

# debug_structure.py
from openpyxl import load_workbook

wb = load_workbook('file.xlsx')
print(f"Sheets: {wb.sheetnames}")

for sheet_name in wb.sheetnames:
    ws = wb[sheet_name]
    print(f"\n=== Sheet: {sheet_name} ===")
    print(f"Dimensions: {ws.dimensions}")
    
    # Print first 10 rows to understand header structure
    for row in ws.iter_rows(min_row=1, max_row=10, values_only=True):
        print([str(cell)[:50] for cell in row])

Step 2: Map Column Positions

Identify where key data fields are located:

# debug_columns.py
from openpyxl import load_workbook

wb = load_workbook('file.xlsx')
ws = wb['Sheet1']

# Examine header row to find column indices
header_row = 1
column_map = {}

for col in ws.iter_cols(min_row=header_row, max_row=header_row):
    for cell in col:
        if cell.value:
            column_map[str(cell.value)] = cell.column_letter

print("Column mapping:", column_map)

# Sample data rows to verify structure
for row_num in range(2, min(6, ws.max_row + 1)):
    row_data = [ws.cell(row=row_num, column=col).value 
                for col in range(1, ws.max_column + 1)]
    print(f"Row {row_num}: {row_data}")

Step 3: Identify Row Patterns

Understand how data rows are structured (e.g., summary rows, detail rows, blank separators):

# debug_rows.py
from openpyxl import load_workbook

wb = load_workbook('file.xlsx')
ws = wb['Sheet1']

row_types = []
for row_num in range(1, min(30, ws.max_row + 1)):
    row_values = [ws.cell(row=row_num, column=col).value 
                  for col in range(1, ws.max_column + 1)]
    
    non_empty = sum(1 for v in row_values if v is not None and str(v).strip())
    
    # Classify row type
    if non_empty == 0:
        row_type = "blank"
    elif non_empty == 1:
        row_type = "summary/label"
    elif non_empty == ws.max_column:
        row_type = "full_data"
    else:
        row_type = "partial"
    
    row_types.append((row_num, row_type, row_values[:5]))

for rt in row_types:
    print(f"Row {rt[0]} ({rt[1]}): {rt[2]}")

Step 4: Document Findings

Before writing extraction logic, summarize:

Sheet names and their purposes
Header row location and column mappings
Data row patterns (which rows contain actual data vs. headers/summaries)
Any special formatting (merged cells, blank separators, grouping rows)

Step 5: Write Extraction Logic

Incorporate findings into your final processing script:

# extract_data.py
from openpyxl import load_workbook
import pandas as pd

wb = load_workbook('file.xlsx')
ws = wb['Sheet1']

# Use column mappings from debug phase
STORE_COL = 'B'  # Column 2
WEEK1_COL = 'D'  # Column 4
WEEK2_COL = 'E'  # Column 5

# Skip header rows and summary rows based on debug findings
data_rows = []
for row_num in range(5, ws.max_row + 1):  # Start after header based on debug
    # Skip summary/blank rows
    if ws.cell(row=row_num, column=2).value is None:
        continue
    if 'TOTAL' in str(ws.cell(row=row_num, column=2).value).upper():
        continue
    
    row_data = {
        'store': ws.cell(row=row_num, column=2).value,
        'week1': ws.cell(row=row_num, column=4).value,
        'week2': ws.cell(row=row_num, column=5).value,
    }
    data_rows.append(row_data)

df = pd.DataFrame(data_rows)
print(df.head())

Best Practices

Always start with exploration - Never assume Excel structure matches expectations
Save debug scripts - Keep them in your project for future reference and debugging
Print generously - Use verbose output during exploration to catch edge cases
Verify row-by-row - Don't assume all data rows follow the same pattern
Handle merged cells - Check for merged cells that span multiple rows/columns

Common Pitfalls to Avoid

Assuming header is always row 1
Assuming all rows between first and last contain data
Not checking for hidden sheets or protected ranges
Ignoring cell formatting that indicates row type (bold, indentation)
Not handling None values or empty strings consistently

File Naming Convention

Use descriptive names for debug scripts:

```
debug_structure.py
```
- Overall file/sheet structure
```
debug_columns.py
```
- Column positions and headers
```
debug_rows.py
```
- Row patterns and data boundaries
```
debug_values.py
```
- Value patterns and edge cases

Keep debug scripts alongside your extraction script for maintainability.