Claude-skill-registry Data Cleaner
Use this skill when the user needs to analyze, clean, or prepare datasets. Helps with listing columns, detecting data types (text, categorical, ordinal, numeric), identifying data quality issues, and cleaning values that don't fit expected patterns. Invoke when users mention data cleaning, data quality, column analysis, type detection, or preparing datasets.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-cleaner" ~/.claude/skills/majiayu000-claude-skill-registry-data-cleaner && rm -rf "$T"
skills/data/data-cleaner/SKILL.mdData Cleaning Skill
This skill helps analyze and clean datasets by detecting data types, identifying quality issues, and suggesting or applying corrections.
Core Capabilities
- Sample Rows: View random sample of rows from a dataset using pandas.sample()
- Column Analysis: List all columns with basic statistics and sample values
- Type Detection: Automatically detect if columns are:
- Numeric (integer, float)
- Categorical (limited unique values)
- Ordinal (ordered categories)
- Text (free-form text)
- DateTime
- Boolean
- Data Quality Reports: Comprehensive quality analysis with severity levels and completeness scores
- Value Mapping Generation: Auto-generate standardization functions for categorical data
- Value Cleaning: Fix common issues like extra whitespace, inconsistent casing, invalid values
- Validation Reports: Compare before/after cleaning to verify transformations
Instructions
When the user requests data cleaning assistance:
- Identify the dataset: Ask for the file path if not provided
- Generate quality report: Use
for comprehensive quality analysisscripts/data_quality_report.py - Analyze columns: Use
to get an overview of all columnsscripts/analyze_columns.py - Detect types: Use
to determine the data type of each columnscripts/detect_types.py - Generate value mappings: Use
for categorical columns needing standardizationscripts/value_mapping_generator.py - Present findings: Show the user:
- Data quality grade and issues
- Column names and detected types
- Suggested value mappings
- Sample problematic values
- Suggest fixes: Recommend cleaning strategies based on issues found
- Apply cleaning: If user approves, use
to fix issuesscripts/clean_values.py - Validate results: Use
to compare before/after and confirm changesscripts/validation_report.py
Supported File Formats
All scripts support multiple file formats with automatic detection:
- CSV (.csv) - Comma-separated values
- Excel (.xlsx, .xls) - Microsoft Excel files
- JSON (.json) - JSON arrays or objects
- JSONL (.jsonl) - JSON Lines format (one JSON object per line)
File format is auto-detected from the extension. You can also explicitly specify the format using the
--format parameter.
Using the Python Scripts
All scripts should be run using
uv for fast, dependency-managed execution:
uv run python scripts/<script_name>.py [arguments]
Or use
uvx to run scripts with automatic dependency installation:
uvx --from . python scripts/<script_name>.py [arguments]
Note: The first time you run a script,
uv will automatically install the required dependencies (pandas, numpy, openpyxl) in an isolated environment. Subsequent runs will be much faster.
sample_rows.py
View a random sample of rows from a dataset.
Usage:
uv run python scripts/sample_rows.py <file_path> [--n 10] [--format csv|excel|json|jsonl] [--output table|json|csv] [--seed SEED]
Options:
: Number of rows to sample (default: 10)--n
: Input file format - csv, excel, json, or jsonl (auto-detected if not specified)--format
: Output format - table (human-readable), json, or csv (default: table)--output
: Random seed for reproducibility (optional)--seed
Output: Random sample of rows in the specified format
Examples:
# Sample 5 random rows from CSV python scripts/sample_rows.py data.csv --n 5 # Sample from JSON Lines file python scripts/sample_rows.py data.jsonl --n 10 # Sample with reproducible results python scripts/sample_rows.py data.csv --n 10 --seed 42 # Output as JSON python scripts/sample_rows.py data.xlsx --n 20 --output json
analyze_columns.py
Analyzes all columns in a dataset and provides summary statistics.
Usage:
uv run python scripts/analyze_columns.py <file_path> [--format csv|excel|json|jsonl]
Output: JSON with column names, types, null counts, unique counts, and sample values
Examples:
# Analyze CSV file python scripts/analyze_columns.py customers.csv # Analyze JSONL file python scripts/analyze_columns.py events.jsonl
detect_types.py
Detects the semantic type of each column (numeric, categorical, ordinal, text, datetime).
Usage:
uv run python scripts/detect_types.py <file_path> [--format csv|excel|json|jsonl]
Output: JSON mapping columns to detected types with confidence scores
Examples:
# Detect types in CSV python scripts/detect_types.py data.csv # Detect types in JSON file python scripts/detect_types.py data.json
clean_values.py
Cleans specific columns based on detected issues.
Usage:
uv run python scripts/clean_values.py <input_file> <output_file> [--operations json_string] [--input-format csv|excel|json|jsonl] [--output-format csv|excel|json|jsonl]
Options:
: JSON string defining cleaning operations--operations
: Input file format (auto-detected if not specified)--input-format
: Output file format (auto-detected if not specified)--output-format
Operations JSON format:
{ "column_name": { "operation": "trim|lowercase|uppercase|remove_special|fill_missing|convert_type", "params": {} } }
Examples:
# Clean CSV file and output as CSV python scripts/clean_values.py data.csv cleaned.csv --operations '{"name":{"operation":"trim"}}' # Clean JSONL file and convert to JSON python scripts/clean_values.py logs.jsonl cleaned.json --input-format jsonl --output-format json --operations '{"status":{"operation":"lowercase"}}' # Clean JSON and output as JSONL python scripts/clean_values.py data.json output.jsonl --output-format jsonl
data_quality_report.py
Generates a comprehensive data quality report with severity levels and completeness scores.
Usage:
uv run python scripts/data_quality_report.py <file_path> [--format csv|excel|json|jsonl] [--output report.json]
Output: JSON report with:
- Overall quality grade (A-F)
- Per-column completeness scores
- Missing values analysis
- Formatting issues
- Outliers detection
- Data type consistency checks
Examples:
# Generate quality report for CSV python scripts/data_quality_report.py data.csv --output report.json # Generate report for JSONL file python scripts/data_quality_report.py logs.jsonl
value_mapping_generator.py
Auto-generates standardization mappings and Python functions for categorical columns.
Usage:
uv run python scripts/value_mapping_generator.py <file_path> [--column COLUMN] [--threshold 20] [--format csv|excel|json|jsonl] [--output-functions functions.py]
Output: JSON with:
- Suggested value mappings
- Groups of similar values
- Auto-generated Python standardization functions
- Before/after value counts
Options:
: Analyze specific column only--column
: Max unique values to consider categorical (default: 20)--threshold
: File format - csv, excel, json, or jsonl (auto-detected if not specified)--format
: Write Python functions to file--output-functions
Examples:
# Generate mappings for all categorical columns python scripts/value_mapping_generator.py survey.csv # Generate mappings for specific column in JSONL file python scripts/value_mapping_generator.py events.jsonl --column user_type
validation_report.py
Compares original and cleaned datasets to validate transformations.
Usage:
uv run python scripts/validation_report.py <original_file> <cleaned_file> [--format csv|excel|json|jsonl] [--output validation.json]
Output: JSON report with:
- Transformation examples for each column
- Data loss analysis
- Before/after distribution comparisons
- Validation status (pass/review_needed)
- Recommendations
Examples:
# Validate CSV cleaning python scripts/validation_report.py original.csv cleaned.csv # Validate JSONL cleaning python scripts/validation_report.py original.jsonl cleaned.jsonl --output validation.json
Workflow Examples
Basic Workflow
- User: "I need to clean my customer data"
- Get file path from user
- Run
to show user a preview of their datasample_rows.py - Run
to assess overall qualitydata_quality_report.py - Run
to see all columnsanalyze_columns.py - Run
to determine typesdetect_types.py - Present findings and ask user which columns to clean
- Run
with appropriate operationsclean_values.py - Run
to verify changesvalidation_report.py - Confirm cleaning completed and show summary
Advanced Workflow (with auto-generated functions)
- User: "Generate cleaning functions for my survey data"
- Run
for quality overviewdata_quality_report.py - Run
for categorical columnsvalue_mapping_generator.py - Show user the generated standardization functions
- User can copy functions into their own cleaning script
- Apply cleaning using the generated functions
- Validate with
validation_report.py
Best Practices
- Always show sample values before suggesting changes
- Explain why certain types were detected
- Ask for confirmation before modifying data
- Create backups or save to new files when cleaning
- Support all file formats: CSV, Excel, JSON, and JSONL
- For JSON/JSONL files, pandas expects records-oriented format (list of objects)
- JSONL format is ideal for streaming or large datasets (one JSON object per line)
- You can convert between formats using clean_values.py with
and--input-format--output-format - Provide clear summaries of changes made