Claude-skill-registry csv-data-auditor
Validate and audit CSV data for quality, consistency, and completeness. Use when you need to check CSV files for data issues, missing values, or format inconsistencies.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/csv-data-auditor" ~/.claude/skills/majiayu000-claude-skill-registry-csv-data-auditor && rm -rf "$T"
manifest:
skills/data/csv-data-auditor/SKILL.mdsource content
CSV Data Auditor
Instructions
When auditing CSV data, perform these validation checks systematically:
1. File Structure Validation
- Verify the file exists and is readable
- Check if the file is a valid CSV format
- Ensure consistent delimiter usage
- Verify proper quoting and escaping
- Check for balanced quotes
2. Header Analysis
- Confirm headers exist (unless headerless CSV)
- Check for duplicate header names
- Validate header naming conventions
- Ensure headers are descriptive and consistent
- Check for special characters in headers
3. Row Consistency
- Count total rows and columns
- Verify all rows have the same number of columns
- Check for empty rows or rows with only whitespace
- Identify truncated rows
- Detect corrupted or malformed rows
4. Data Type Validation
For each column, validate the expected data type:
Numeric Fields
- Check for non-numeric values
- Validate decimal point usage
- Look for negative values where inappropriate
- Check for extremely large/small values
Date/Time Fields
- Verify date format consistency (ISO 8601, US, EU, etc.)
- Check for invalid dates (e.g., February 30th)
- Validate time zone handling
- Look for future dates where inappropriate
Text Fields
- Check for encoding issues
- Look for unexpected special characters
- Verify string length constraints
- Check for leading/trailing whitespace
Boolean Fields
- Validate true/false representations
- Check for 1/0, Y/N, Yes/No consistency
- Look for ambiguous values
5. Data Quality Checks
Missing Values
- Count NULL/empty values per column
- Calculate missing value percentages
- Identify patterns in missing data
- Check for placeholders like "N/A", "null", "-"
Duplicates
- Find exact duplicate rows
- Check for duplicate IDs or keys
- Identify near-duplicates (typos, variations)
- Validate uniqueness constraints
Outliers
- Identify statistical outliers in numeric columns
- Check for values outside expected ranges
- Look for anomalies in categorical data
- Validate business logic constraints
Consistency Checks
- Cross-validate related columns
- Check referential integrity
- Validate business rules
- Look for logical contradictions
6. Validation Script (Python)
Create a Python script to perform automated checks:
import pandas as pd import numpy as np from datetime import datetime def audit_csv(file_path, delimiter=','): """Perform comprehensive CSV audit""" # Load CSV try: df = pd.read_csv(file_path, delimiter=delimiter) except Exception as e: return {"error": f"Failed to load CSV: {str(e)}"} audit_report = { "file_info": { "rows": len(df), "columns": len(df.columns), "headers": list(df.columns) }, "issues": [] } # Check for missing values missing_counts = df.isnull().sum() for col, count in missing_counts.items(): if count > 0: percentage = (count / len(df)) * 100 audit_report["issues"].append({ "type": "missing_values", "column": col, "count": count, "percentage": round(percentage, 2) }) # Check for duplicates duplicate_rows = df.duplicated().sum() if duplicate_rows > 0: audit_report["issues"].append({ "type": "duplicates", "count": duplicate_rows }) # Data type validation for col in df.columns: # Check for mixed types in object columns if df[col].dtype == 'object': sample_values = df[col].dropna().head(10) # Add specific type checks based on column name patterns return audit_report
7. Report Format
Generate a structured audit report:
# CSV Audit Report ## File Information - File: data.csv - Size: 15.2 MB - Rows: 50,000 - Columns: 12 - Headers: id, name, email, age, join_date, salary, department, ... ## Issues Found ### Critical Issues 1. **Missing Values**: - `email`: 2,500 missing (5.0%) - `salary`: 150 missing (0.3%) ### Warnings 1. **Inconsistent Date Format**: - `join_date`: Mix of ISO and US formats detected - Examples: 2023-01-15, 01/15/2023, 15-Jan-2023 2. **Potential Outliers**: - `age`: Values 0 and 150 detected - `salary`: Extremely high values > $1M ### Recommendations 1. Clean up email field - contact data source 2. Standardize date format to ISO 8601 3. Validate age and salary ranges 4. Remove or investigate duplicate rows ## Summary - Overall Quality: Good - Issues to Fix: 3 critical, 5 warnings - Estimated Fix Time: 2-3 hours
8. Common Issues and Solutions
Encoding Problems
- Try different encodings (UTF-8, Latin-1, Windows-1252)
- Use
library to detect encodingchardet - Handle BOM (Byte Order Mark) if present
Large Files
- Process in chunks for memory efficiency
- Use Dask for out-of-core processing
- Sample data for quick validation
Complex CSVs
- Handle quoted fields with embedded delimiters
- Process multi-line records carefully
- Validate escape character usage
Usage
- Provide the CSV file path
- Specify delimiter if not comma
- Configure validation rules based on your data
- Review the generated audit report
- Address issues systematically
Tips
- Always keep a backup of original data
- Document any data transformations
- Create validation rules based on business requirements
- Consider using schema validation tools like Great Expectations
- Automate regular audits for production data