Claude-skill-registry data-cleaning-standards
Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize".
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-cleaning-standards" ~/.claude/skills/majiayu000-claude-skill-registry-data-cleaning-standards && rm -rf "$T"
skills/data/data-cleaning-standards/SKILL.mdData Cleaning Standards
Clean and standardize vehicle insurance data following established business rules.
When to Activate
Use this skill when the user:
- Says "clean the data" or "standardize data"
- Mentions "missing values", "null handling", or "fill missing"
- Asks "remove duplicates" or "deduplicate"
- Wants to "normalize dates" or "standardize formats"
- Mentions data preparation or preprocessing
Quick Start Workflow
Step 1: Handle Missing Values ↓ Step 2: Remove Duplicates ↓ Step 3: Standardize Formats ↓ Step 4: Handle Outliers
Step 1: Handle Missing Values
1.1 Missing Value Strategy by Field Type
| Field Type | Strategy | Fill Value | Why |
|---|---|---|---|
| 三级机构 | Lookup from mapping | From mapping table | Authoritative source |
| 团队简称 | Keep null | (empty string) | Optional field |
| 签单/批改保费 | Delete row | N/A | Critical metric |
| 手续费含税 | Fill zero | | Legitimate zero commission |
| 是否续保 | Keep null | | Display as "Unknown" |
| String fields | Fill empty | | Avoid None errors |
| Numeric fields | Delete or 0 | Depends on field | Case by case |
1.2 Implementation
def handle_missing_values(df, staff_mapping): """Apply missing value strategy""" # 1. 三级机构 - lookup from mapping if '三级机构' in df.columns: for idx in df[df['三级机构'].isnull()].index: staff = df.at[idx, '业务员'] mapped_info = lookup_staff_info(staff, staff_mapping) if mapped_info: df.at[idx, '三级机构'] = mapped_info['三级机构'] # 2. 手续费 - fill zero if '手续费含税' in df.columns: df['手续费含税'] = df['手续费含税'].fillna(0) # 3. 签单保费 - delete missing rows before = len(df) df = df[df['签单/批改保费'].notnull()] after = len(df) if before > after: print(f"Deleted {before - after} rows with missing premium") # 4. String fields - fill empty string_cols = df.select_dtypes(include=['object']).columns df[string_cols] = df[string_cols].fillna('') return df
Step 2: Remove Duplicates
2.1 Deduplication Rules
Composite Key:
保单号 + 投保确认时间
Keep Strategy:
keep='last' (most recent record)
Why: Same policy may have multiple updates (批改)
2.2 Implementation
def remove_duplicates(df): """Remove duplicate records""" # Ensure date is datetime df['投保确认时间'] = pd.to_datetime(df['投保确认时间'], errors='coerce') # Use duplicated() to avoid type issues before = len(df) dup_mask = df.duplicated(subset=['保单号', '投保确认时间'], keep='last') df = df[~dup_mask] after = len(df) if before > after: print(f"Removed {before - after} duplicates") return df
Step 3: Standardize Formats
3.1 Date Standardization
Target Format:
datetime64[ns]
def standardize_dates(df): """Convert all date fields to datetime""" date_cols = ['刷新时间', '投保确认时间', '保险起期'] for col in date_cols: if col in df.columns: df[col] = pd.to_datetime(df[col], errors='coerce') invalid = df[col].isnull().sum() if invalid > 0: print(f"⚠️ {col}: {invalid} invalid dates converted to NaT") return df
3.2 Numeric Standardization
Target Format:
float64
def standardize_numerics(df): """Convert numeric fields to float""" numeric_cols = ['签单/批改保费', '签单数量', '手续费', '手续费含税', '增值税'] for col in numeric_cols: if col in df.columns: df[col] = pd.to_numeric(df[col], errors='coerce') return df
3.3 String Standardization
Rules:
- Strip leading/trailing whitespace
- Map variations (Y/N → 是/否)
def standardize_strings(df): """Clean string fields""" # Strip whitespace string_cols = df.select_dtypes(include=['object']).columns for col in string_cols: df[col] = df[col].astype(str).str.strip() # Standardize yes/no fields yes_no_cols = ['是否续保', '是否新能源', '是否过户车', '是否异地车'] for col in yes_no_cols: if col in df.columns: df[col] = df[col].map({ 'Y': '是', 'N': '否', 'y': '是', 'n': '否', '1': '是', '0': '否' }).fillna(df[col]) return df
Step 4: Handle Outliers
4.1 Outlier Detection Rules
Important: NEVER delete negative premiums (legitimate business data)
| Outlier Type | Detection Rule | Action |
|---|---|---|
| Negative premium | | ✅ KEEP (refunds/adjustments) |
| Extreme premium | | ⚠️ Flag only |
| Zero commission | | ✅ KEEP (normal) |
| Negative amount | | ⚠️ Flag only |
4.2 Implementation
def detect_outliers(df): """Detect outliers but DO NOT delete""" outliers = {} # 1. Extreme premium (flag only) if '签单/批改保费' in df.columns: extreme = df[ (df['签单/批改保费'] < -1000000) | (df['签单/批改保费'] > 100000) ] if len(extreme) > 0: outliers['extreme_premium'] = { 'count': len(extreme), 'samples': extreme['保单号'].head(5).tolist() } # 2. Negative amounts (flag only) if '签单/批改保额' in df.columns: negative_amt = df[df['签单/批改保额'] < 0] if len(negative_amt) > 0: outliers['negative_amount'] = { 'count': len(negative_amt), 'samples': negative_amt['保单号'].head(5).tolist() } # Report outliers without deleting if outliers: print("⚠️ Detected outliers (kept in data):") for key, info in outliers.items(): print(f" - {key}: {info['count']} records") return df, outliers
Critical Rule: Never filter out negative premiums:
# ❌ WRONG - DO NOT DO THIS df = df[df['签单/批改保费'] > 0] # ✅ CORRECT - Keep all values total_premium = df['签单/批改保费'].sum() # May be negative
Complete Cleaning Pipeline
All-in-One Function
def clean_data(df, staff_mapping): """Complete cleaning pipeline""" print("Starting data cleaning pipeline...") initial_count = len(df) # Step 1: Handle missing values df = handle_missing_values(df, staff_mapping) print(f"✓ Step 1: Handled missing values") # Step 2: Remove duplicates df = remove_duplicates(df) print(f"✓ Step 2: Removed duplicates") # Step 3: Standardize formats df = standardize_dates(df) df = standardize_numerics(df) df = standardize_strings(df) print(f"✓ Step 3: Standardized formats") # Step 4: Detect outliers (no deletion) df, outliers = detect_outliers(df) print(f"✓ Step 4: Detected outliers") final_count = len(df) print(f"\nCleaning complete: {initial_count} → {final_count} records") return df, outliers
Common Use Cases
Case 1: "Clean my CSV file"
import pandas as pd import json # Load data df = pd.read_csv('data.csv', encoding='utf-8-sig') mapping = json.load(open('staff_mapping.json')) # Run full pipeline df_clean, outliers = clean_data(df, mapping) # Save cleaned data df_clean.to_csv('data_cleaned.csv', index=False, encoding='utf-8-sig')
Case 2: "Handle missing institution fields"
# Focus on Step 1 - missing value handling df = handle_missing_values(df, staff_mapping) # Check results missing_before = df_original['三级机构'].isnull().sum() missing_after = df['三级机构'].isnull().sum() print(f"Fixed {missing_before - missing_after} missing institutions")
Case 3: "Remove duplicate policies"
# Focus on Step 2 - deduplication df_unique = remove_duplicates(df) print(f"Removed {len(df) - len(df_unique)} duplicates")
Cleaning Checklist
Before cleaning:
- Backup original data
- Load staff mapping file
- Check file encoding (use
for Excel exports)utf-8-sig
During cleaning:
- Handle missing values (Step 1)
- Remove duplicates (Step 2)
- Standardize formats (Step 3)
- Detect outliers (Step 4)
After cleaning:
- Verify record count change is reasonable
- Check critical fields are complete
- Review outlier report
- Save cleaned data with new filename
Troubleshooting
"Many records deleted"
Check: Are you accidentally deleting negative premiums?
# Check negative premium count negative_count = (df['签单/批改保费'] < 0).sum() print(f"Negative premiums: {negative_count} (should be kept)")
"Date conversion creates many NaT"
Solution: Check date format
# Inspect date column print(df['投保确认时间'].head()) # Try different format df['投保确认时间'] = pd.to_datetime( df['投保确认时间'], format='%Y/%m/%d', # Adjust format errors='coerce' )
"Duplicates not removed"
Check: Ensure date column is datetime type
print(df['投保确认时间'].dtype) # Should be datetime64[ns]
Related Files
Data processor: backend/data_processor.py
- See
method (lines 132-156)_clean_data() - See
method (lines 158-192)merge_with_existing()
Field definitions: docs/FIELD_MAPPING.md
Related Skills:
- Run this BEFORE cleaningfield-validation
- Update mapping tablestaff-mapping-management
Skill Version: v1.0 Created: 2025-11-09 File Size: ~310 lines Focuses On: Data cleaning only (not validation or mapping)