Claude-skill-registry data-structure-checker
This skill should be used when reading any tabular data file (Excel, CSV, Parquet, ODS). It automatically detects and fixes common data issues including multi-level headers, encoding problems, empty rows/columns, and data type mismatches. Returns a clean DataFrame ready for analysis with zero user intervention.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-structure-checker" ~/.claude/skills/majiayu000-claude-skill-registry-data-structure-checker && rm -rf "$T"
skills/data/data-structure-checker/SKILL.mdData Structure Checker
Overview
A comprehensive skill for automatically detecting and fixing common data file issues. Input any messy file and receive a clean DataFrame ready for analysis - zero intervention required.
Auto-fixes:
- Multi-level/hierarchical headers (flattens with separator)
- Encoding issues (utf-8, cp949, euc-kr, etc.)
- Empty rows and columns
- Data type inference and conversion
- Duplicate column names
- Unicode path issues (Korean/CJK filenames)
When to Use
This skill should be triggered when:
- Reading any Excel files (.xlsx, .xls)
- Reading any CSV/TSV files
- Reading ODS files (OpenDocument)
- Reading Parquet files
- Encountering "Unnamed:" columns in data
- Dealing with multi-level or hierarchical headers
- Processing Korean/Asian language data files
Usage
Reading Data Files
To read any tabular data file with automatic issue detection and fixing, execute the
scripts/checker.py script:
import sys sys.path.insert(0, 'skills/data-structure-checker/scripts') from checker import smart_read # Read file - handles all issues automatically df = smart_read('data.xlsx') # Read with report of fixes applied df, report = smart_read('data.xlsx', return_report=True)
Diagnosing Files
To analyze a file's structure without reading full data:
from checker import diagnose result = diagnose('data.xlsx') # Returns: {'issues': ['multi_level_headers'], 'recommendations': [...]}
Command Line
# Read and display summary uv run python skills/data-structure-checker/scripts/checker.py data.xlsx # Diagnose without reading uv run python skills/data-structure-checker/scripts/checker.py data.xlsx --diagnose
API Reference
smart_read(file_path, separator='_', return_report=False, sheet_name=0)
smart_read(file_path, separator='_', return_report=False, sheet_name=0)Main entry point for reading files with automatic issue resolution.
Parameters:
: Path to the file (handles Korean/Unicode filenames)file_path
: Character(s) for joining multi-level headers (default:separator
)'_'
: If True, returnreturn_report
tuple(DataFrame, report)
: Sheet name or index for Excel filessheet_name
Returns:
- Clean data ready for analysisDataFrame- Or
if(DataFrame, report)return_report=True
diagnose(file_path, sheet_name=0)
diagnose(file_path, sheet_name=0)Analyze file structure without reading full data.
Returns: Dictionary with detected issues and recommendations.
Report Structure
When
return_report=True, the report contains:
{ 'file_path': 'data/file.xlsx', 'timestamp': '2024-12-17T10:30:00', 'issues_detected': ['multi_level_headers', 'empty_rows_or_columns'], 'fixes_applied': [ 'Flattened 3-level headers with "_" separator', 'Removed 2 empty rows and 0 empty columns' ], 'original_shape': (52, 111), 'final_shape': (49, 111), 'header_rows': [0, 1, 2], 'type_conversions': {'score': 'object -> float64'} }
Issues Handled
Multi-Level Headers
Detects and flattens hierarchical headers:
Before:
응시자 정보 | Unnamed: 1 | Unnamed: 2
After:
응시자 정보_응시코드 | 응시자 정보_성명 | 응시자 정보_부서
Encoding Issues
Auto-detects encoding for CSV files:
- UTF-8 (with/without BOM)
- CP949 (Korean Windows)
- EUC-KR (Korean legacy)
- GBK/GB2312 (Chinese)
- Shift_JIS/EUC-JP (Japanese)
Empty Rows/Columns
Removes rows and columns where all values are NaN.
Data Type Inference
Converts string columns to appropriate types:
- Numeric strings → float64/int64
- Date strings → datetime64
Duplicate Columns
Renames duplicates with suffixes:
['score', 'score', 'score'] → ['score', 'score_1', 'score_2']
Unicode Path Issues
Handles Korean/CJK filenames with different Unicode normalizations (NFC/NFD).
Supported Formats
| Extension | Format | Notes |
|---|---|---|
| Excel | Modern Excel format |
| Excel | Legacy Excel format |
| CSV | Auto-detects encoding |
| TSV | Tab-separated values |
| ODS | OpenDocument Spreadsheet |
| Parquet | Columnar format |
Dependencies
Ensure these packages are installed:
uv pip install openpyxl xlrd odfpy pyarrow
Integration with Deep Insight
To integrate with the coder agent, replace standard pandas read:
# Instead of: import pandas as pd df = pd.read_excel('data.xlsx') # Use: sys.path.insert(0, 'skills/data-structure-checker/scripts') from checker import smart_read df = smart_read('data.xlsx')
Troubleshooting
File not found with Korean filename
The skill handles Unicode normalization automatically. Verify the file path is correct.
Unexpected column names
Check the report's
header_rows field. To specify header rows explicitly:
sys.path.insert(0, 'skills/data-structure-checker/scripts') from reader import read_multi_level df = read_multi_level('data.xlsx', header_rows=[0, 1])
Preserve original types
To skip type inference, create
DataStructureChecker with infer_types=False.