Claude-skill-registry data-cleaning-standards

Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize".

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-cleaning-standards" ~/.claude/skills/majiayu000-claude-skill-registry-data-cleaning-standards && rm -rf "$T"
manifest: skills/data/data-cleaning-standards/SKILL.md
source content

Data Cleaning Standards

Clean and standardize vehicle insurance data following established business rules.

When to Activate

Use this skill when the user:

  • Says "clean the data" or "standardize data"
  • Mentions "missing values", "null handling", or "fill missing"
  • Asks "remove duplicates" or "deduplicate"
  • Wants to "normalize dates" or "standardize formats"
  • Mentions data preparation or preprocessing

Quick Start Workflow

Step 1: Handle Missing Values
  ↓
Step 2: Remove Duplicates
  ↓
Step 3: Standardize Formats
  ↓
Step 4: Handle Outliers

Step 1: Handle Missing Values

1.1 Missing Value Strategy by Field Type

Field TypeStrategyFill ValueWhy
三级机构Lookup from mappingFrom mapping tableAuthoritative source
团队简称Keep null
''
(empty string)
Optional field
签单/批改保费Delete rowN/ACritical metric
手续费含税Fill zero
0
Legitimate zero commission
是否续保Keep null
''
Display as "Unknown"
String fieldsFill empty
''
Avoid None errors
Numeric fieldsDelete or 0Depends on fieldCase by case

1.2 Implementation

def handle_missing_values(df, staff_mapping):
    """Apply missing value strategy"""

    # 1. 三级机构 - lookup from mapping
    if '三级机构' in df.columns:
        for idx in df[df['三级机构'].isnull()].index:
            staff = df.at[idx, '业务员']
            mapped_info = lookup_staff_info(staff, staff_mapping)
            if mapped_info:
                df.at[idx, '三级机构'] = mapped_info['三级机构']

    # 2. 手续费 - fill zero
    if '手续费含税' in df.columns:
        df['手续费含税'] = df['手续费含税'].fillna(0)

    # 3. 签单保费 - delete missing rows
    before = len(df)
    df = df[df['签单/批改保费'].notnull()]
    after = len(df)
    if before > after:
        print(f"Deleted {before - after} rows with missing premium")

    # 4. String fields - fill empty
    string_cols = df.select_dtypes(include=['object']).columns
    df[string_cols] = df[string_cols].fillna('')

    return df

Step 2: Remove Duplicates

2.1 Deduplication Rules

Composite Key:

保单号
+
投保确认时间

Keep Strategy:

keep='last'
(most recent record)

Why: Same policy may have multiple updates (批改)

2.2 Implementation

def remove_duplicates(df):
    """Remove duplicate records"""

    # Ensure date is datetime
    df['投保确认时间'] = pd.to_datetime(df['投保确认时间'], errors='coerce')

    # Use duplicated() to avoid type issues
    before = len(df)
    dup_mask = df.duplicated(subset=['保单号', '投保确认时间'], keep='last')
    df = df[~dup_mask]
    after = len(df)

    if before > after:
        print(f"Removed {before - after} duplicates")

    return df

Step 3: Standardize Formats

3.1 Date Standardization

Target Format:

datetime64[ns]

def standardize_dates(df):
    """Convert all date fields to datetime"""
    date_cols = ['刷新时间', '投保确认时间', '保险起期']

    for col in date_cols:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], errors='coerce')
            invalid = df[col].isnull().sum()
            if invalid > 0:
                print(f"⚠️  {col}: {invalid} invalid dates converted to NaT")

    return df

3.2 Numeric Standardization

Target Format:

float64

def standardize_numerics(df):
    """Convert numeric fields to float"""
    numeric_cols = ['签单/批改保费', '签单数量', '手续费', '手续费含税', '增值税']

    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')

    return df

3.3 String Standardization

Rules:

  • Strip leading/trailing whitespace
  • Map variations (Y/N → 是/否)
def standardize_strings(df):
    """Clean string fields"""

    # Strip whitespace
    string_cols = df.select_dtypes(include=['object']).columns
    for col in string_cols:
        df[col] = df[col].astype(str).str.strip()

    # Standardize yes/no fields
    yes_no_cols = ['是否续保', '是否新能源', '是否过户车', '是否异地车']
    for col in yes_no_cols:
        if col in df.columns:
            df[col] = df[col].map({
                'Y': '是', 'N': '否',
                'y': '是', 'n': '否',
                '1': '是', '0': '否'
            }).fillna(df[col])

    return df

Step 4: Handle Outliers

4.1 Outlier Detection Rules

Important: NEVER delete negative premiums (legitimate business data)

Outlier TypeDetection RuleAction
Negative premium
< 0
KEEP (refunds/adjustments)
Extreme premium
< -1M or > 100K
⚠️ Flag only
Zero commission
== 0
✅ KEEP (normal)
Negative amount
保额 < 0
⚠️ Flag only

4.2 Implementation

def detect_outliers(df):
    """Detect outliers but DO NOT delete"""
    outliers = {}

    # 1. Extreme premium (flag only)
    if '签单/批改保费' in df.columns:
        extreme = df[
            (df['签单/批改保费'] < -1000000) |
            (df['签单/批改保费'] > 100000)
        ]
        if len(extreme) > 0:
            outliers['extreme_premium'] = {
                'count': len(extreme),
                'samples': extreme['保单号'].head(5).tolist()
            }

    # 2. Negative amounts (flag only)
    if '签单/批改保额' in df.columns:
        negative_amt = df[df['签单/批改保额'] < 0]
        if len(negative_amt) > 0:
            outliers['negative_amount'] = {
                'count': len(negative_amt),
                'samples': negative_amt['保单号'].head(5).tolist()
            }

    # Report outliers without deleting
    if outliers:
        print("⚠️  Detected outliers (kept in data):")
        for key, info in outliers.items():
            print(f"  - {key}: {info['count']} records")

    return df, outliers

Critical Rule: Never filter out negative premiums:

# ❌ WRONG - DO NOT DO THIS
df = df[df['签单/批改保费'] > 0]

# ✅ CORRECT - Keep all values
total_premium = df['签单/批改保费'].sum()  # May be negative

Complete Cleaning Pipeline

All-in-One Function

def clean_data(df, staff_mapping):
    """Complete cleaning pipeline"""

    print("Starting data cleaning pipeline...")
    initial_count = len(df)

    # Step 1: Handle missing values
    df = handle_missing_values(df, staff_mapping)
    print(f"✓ Step 1: Handled missing values")

    # Step 2: Remove duplicates
    df = remove_duplicates(df)
    print(f"✓ Step 2: Removed duplicates")

    # Step 3: Standardize formats
    df = standardize_dates(df)
    df = standardize_numerics(df)
    df = standardize_strings(df)
    print(f"✓ Step 3: Standardized formats")

    # Step 4: Detect outliers (no deletion)
    df, outliers = detect_outliers(df)
    print(f"✓ Step 4: Detected outliers")

    final_count = len(df)
    print(f"\nCleaning complete: {initial_count} → {final_count} records")

    return df, outliers

Common Use Cases

Case 1: "Clean my CSV file"

import pandas as pd
import json

# Load data
df = pd.read_csv('data.csv', encoding='utf-8-sig')
mapping = json.load(open('staff_mapping.json'))

# Run full pipeline
df_clean, outliers = clean_data(df, mapping)

# Save cleaned data
df_clean.to_csv('data_cleaned.csv', index=False, encoding='utf-8-sig')

Case 2: "Handle missing institution fields"

# Focus on Step 1 - missing value handling
df = handle_missing_values(df, staff_mapping)

# Check results
missing_before = df_original['三级机构'].isnull().sum()
missing_after = df['三级机构'].isnull().sum()
print(f"Fixed {missing_before - missing_after} missing institutions")

Case 3: "Remove duplicate policies"

# Focus on Step 2 - deduplication
df_unique = remove_duplicates(df)
print(f"Removed {len(df) - len(df_unique)} duplicates")

Cleaning Checklist

Before cleaning:

  • Backup original data
  • Load staff mapping file
  • Check file encoding (use
    utf-8-sig
    for Excel exports)

During cleaning:

  • Handle missing values (Step 1)
  • Remove duplicates (Step 2)
  • Standardize formats (Step 3)
  • Detect outliers (Step 4)

After cleaning:

  • Verify record count change is reasonable
  • Check critical fields are complete
  • Review outlier report
  • Save cleaned data with new filename

Troubleshooting

"Many records deleted"

Check: Are you accidentally deleting negative premiums?

# Check negative premium count
negative_count = (df['签单/批改保费'] < 0).sum()
print(f"Negative premiums: {negative_count} (should be kept)")

"Date conversion creates many NaT"

Solution: Check date format

# Inspect date column
print(df['投保确认时间'].head())

# Try different format
df['投保确认时间'] = pd.to_datetime(
    df['投保确认时间'],
    format='%Y/%m/%d',  # Adjust format
    errors='coerce'
)

"Duplicates not removed"

Check: Ensure date column is datetime type

print(df['投保确认时间'].dtype)  # Should be datetime64[ns]

Related Files

Data processor: backend/data_processor.py

  • See
    _clean_data()
    method (lines 132-156)
  • See
    merge_with_existing()
    method (lines 158-192)

Field definitions: docs/FIELD_MAPPING.md

Related Skills:

  • field-validation
    - Run this BEFORE cleaning
  • staff-mapping-management
    - Update mapping table

Skill Version: v1.0 Created: 2025-11-09 File Size: ~310 lines Focuses On: Data cleaning only (not validation or mapping)