Claude-skill-registry excel-parser

Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/excel-parser" ~/.claude/skills/majiayu000-claude-skill-registry-excel-parser && rm -rf "$T"

manifest: skills/data/excel-parser/SKILL.md

source content

Excel Parser

Overview

Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).

Core Philosophy: Scout Pattern

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

Metadata Scanning - Use
```
openpyxl
```
to scan file structure without loading data
Complexity Scoring - Calculate score based on merged cells, row count, and layout
Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
Optimized Execution - Execute with the most appropriate tool for the file type

Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

When to Use This Skill

Use excel-parser when:

Processing Excel/CSV files with unknown structure or varying complexity
Handling files ranging from simple data tables to complex financial reports
Need to optimize between processing speed and extraction accuracy
Working with files that may contain merged cells, multi-level headers, or irregular layouts

Skip this skill when:

File structure is already known and documented
Processing simple, well-structured tables with confirmed format
Using predefined scripts for specific file formats

Processing Workflow

Step 1: Analyze File Complexity

Use the

scripts/complexity_analyzer.py

to scan file metadata:

python scripts/complexity_analyzer.py <file_path> [sheet_name]

What it analyzes (without loading data):

Merged cell distribution (shallow vs deep in the table)
Row count and data continuity
Empty row interruptions (indicates multi-table layouts)

Output (JSON format):

{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "行数过多 (>1000), 强制使用 Pandas 模式"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}

Step 2: Route to Optimal Strategy

Based on complexity analysis:

is_complex = false → Use Path A (Pandas Standard Mode)
is_complex = true → Use Path B (HTML Semantic Mode)

Step 3: Execute Processing

Follow the selected path's workflow to extract data.

Complexity Scoring Rules

Rule 1: Deep Merged Cells

Condition: Merged cells appearing beyond row 5
Interpretation: Complex table structure (not just header formatting)
Decision: Mark as complex if >2 deep merges detected
Example: Financial reports with merged category labels in data region

Rule 2: Empty Row Interruptions

Condition: Multiple empty rows within the table
Interpretation: Multiple sub-tables in single sheet
Decision: Mark as complex if >2 empty row interruptions found
Example: Summary table + detail table in one sheet

Rule 3: Row Count Override

Condition: Total rows >1000
Interpretation: Too large for HTML processing (token explosion)
Decision: Force Pandas mode regardless of complexity
Rationale: HTML conversion would exceed token limits

Rule 4: Default (Standard Table)

Condition: No deep merges, continuous data, moderate size
Interpretation: Standard data table
Decision: Use Pandas for optimal speed

Path A: Pandas Standard Mode

When: Simple/large tables (most common case)

Strategy: Let LLM analyze ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

Workflow:

Sample First 20 Rows

import pandas as pd

df_sample = pd.read_excel(file_path, sheet_name=sheet_name, header=None, nrows=20)
csv_sample = df_sample.to_csv(index=False)

LLM Analyzes Header Position

Prompt template:

You are a Pandas expert. Analyze the following first 20 rows of an Excel file (in CSV format):

{csv_sample}

Task: Identify the true header row index (0-based). If row 0 is the header, return 0.
If the first two rows are titles and row 2 is the header, return 2.

Return JSON format:
{
    "header_row": <int>,
    "explanation": "<reasoning>"
}

Parse LLM Response

import json, re

json_match = re.search(r'\{.*\}', llm_response, re.DOTALL)
config = json.loads(json_match.group())
header_idx = config.get("header_row", 0)

Read Full Data with Correct Parameters

full_df = pd.read_excel(file_path, sheet_name=sheet_name, header=header_idx)
print(f"Successfully loaded DataFrame: {full_df.shape}")

Token Cost: ~500 tokens (only 20 rows analyzed by LLM) Processing Speed: Very fast (Pandas native speed)

Path B: HTML Semantic Mode

When: Complex/irregular tables (merged cells, multi-level headers)

Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then let LLM extract data understanding the visual layout.

Workflow:

Convert to Semantic HTML

import openpyxl
from openpyxl.utils import get_column_letter

wb = openpyxl.load_workbook(file_path, data_only=True)
sheet = wb[sheet_name]

html_parts = ['<table border="1">']

# Track merged cell spans
merge_map = {}
for merge in sheet.merged_cells.ranges:
    min_col, min_row, max_col, max_row = merge.bounds
    merge_map[(min_row, min_col)] = {
        'rowspan': max_row - min_row + 1,
        'colspan': max_col - min_col + 1
    }

# Build HTML with rowspan/colspan
for row_idx, row in enumerate(sheet.iter_rows(), start=1):
    html_parts.append('<tr>')
    for col_idx, cell in enumerate(row, start=1):
        # Skip cells that are part of a merge (not the top-left)
        if any((row_idx, col_idx) in range(...) for merge in sheet.merged_cells.ranges):
            if (row_idx, col_idx) not in merge_map:
                continue

        # Get cell value
        value = cell.value or ''

        # Add rowspan/colspan if this is a merged cell origin
        attrs = ''
        if (row_idx, col_idx) in merge_map:
            span = merge_map[(row_idx, col_idx)]
            if span['rowspan'] > 1:
                attrs += f' rowspan="{span["rowspan"]}"'
            if span['colspan'] > 1:
                attrs += f' colspan="{span["colspan"]}"'

        html_parts.append(f'<td{attrs}>{value}</td>')
    html_parts.append('</tr>')

html_parts.append('</table>')
html_content = '\n'.join(html_parts)

LLM Extracts Structured Data

Prompt template:

Analyze the following HTML table and extract key data as JSON.
Pay attention to merged cells (rowspan/colspan) which indicate hierarchical headers or grouped data.

HTML (first 2000 chars):
{html_content[:2000]}

Task: Extract the data preserving the semantic structure. Return JSON format suitable for the data type.

Parse and Return

import json, re

json_match = re.search(r'\{.*\}', llm_response, re.DOTALL)
extracted_data = json.loads(json_match.group())

Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (LLM semantic extraction) Use Case: Only for small, complex files where Pandas would fail

Implementation Code Template

Complete executable example combining both paths:

import openpyxl
from openpyxl.utils import range_boundaries
import pandas as pd
import json
import sys

class SmartExcelRouter:
    def __init__(self, file_path):
        self.file_path = file_path
        self.wb = openpyxl.load_workbook(file_path, read_only=False, data_only=True)

    def analyze_sheet_complexity(self, sheet_name):
        """Scout function: Calculate complexity score without loading data."""
        sheet = self.wb[sheet_name]

        max_row = sheet.max_row
        merged_ranges = sheet.merged_cells.ranges

        # Analyze merge distribution
        deep_merges = 0
        for merge in merged_ranges:
            min_col, min_row, max_col, max_row_merge = range_boundaries(str(merge))
            if min_row > 5:  # Beyond header region
                deep_merges += 1

        # Check empty row interruptions
        empty_interruptions = 0
        if max_row < 200:  # Only check short tables
            for row in sheet.iter_rows(min_row=1, max_row=max_row):
                if all(cell.value is None for cell in row):
                    empty_interruptions += 1

        # Apply scoring rules
        is_complex = False
        reasons = []

        if deep_merges > 2:
            is_complex = True
            reasons.append(f"检测到数据区域有 {deep_merges} 处合并单元格")

        if max_row < 300 and empty_interruptions > 2:
            is_complex = True
            reasons.append("检测到多处空行,疑为多子表布局")

        if max_row > 1000:
            is_complex = False
            reasons = ["行数过多 (>1000), 强制使用 Pandas 模式"]

        if not is_complex and not reasons:
            reasons.append("结构规则,未检测到复杂布局")

        return {
            "is_complex": is_complex,
            "recommended_strategy": "html" if is_complex else "pandas",
            "reasons": reasons,
            "stats": {
                "total_rows": max_row,
                "deep_merges": deep_merges,
                "empty_interruptions": empty_interruptions
            }
        }

    def process_pandas_mode(self, sheet_name):
        """Path A: Read first 20 rows, LLM determines header, Pandas loads full data."""
        print(f"[Pandas Mode] Processing {sheet_name}...")

        # Step 1: Sample
        df_sample = pd.read_excel(self.file_path, sheet_name=sheet_name, header=None, nrows=20)
        csv_sample = df_sample.to_csv(index=False)

        # Step 2: LLM analyzes (you need to implement call_llm function)
        # For now, assume header is at row 0
        header_idx = 0  # Replace with LLM analysis

        # Step 3: Full read
        full_df = pd.read_excel(self.file_path, sheet_name=sheet_name, header=header_idx)
        print(f"   Loaded DataFrame: {full_df.shape}")

        return full_df

    def process_html_mode(self, sheet_name):
        """Path B: Convert to HTML, LLM extracts semantically."""
        print(f"[HTML Mode] Processing {sheet_name}...")

        # Implementation would include HTML conversion as shown above
        # For brevity, returning placeholder
        return {"message": "Use HTML conversion code from Path B section"}

# Usage
if __name__ == "__main__":
    router = SmartExcelRouter("example.xlsx")

    for sheet_name in router.wb.sheetnames:
        analysis = router.analyze_sheet_complexity(sheet_name)
        print(f"\nSheet: {sheet_name}")
        print(f"Strategy: {analysis['recommended_strategy']}")
        print(f"Reasons: {analysis['reasons']}")

        if analysis['is_complex']:
            router.process_html_mode(sheet_name)
        else:
            router.process_pandas_mode(sheet_name)

Best Practices

1. Trust the Scout

Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

2. Respect the Row Count Rule

Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

3. Pandas First for Unknown Files

When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

4. Cache Analysis Results

If processing multiple sheets from same file, run analysis once and cache results.

5. Preserve Original Files

Never modify the original Excel file during analysis or processing.

Dependencies

Required Python packages:

```
openpyxl
```
- Metadata scanning and Excel file manipulation
```
pandas
```
- High-speed data reading and manipulation
Access to Bash tool for executing Python scripts

Resources

This skill includes:

```
scripts/complexity_analyzer.py
```
- Standalone executable for complexity analysis
Implementation code templates in this document for both processing paths