Claude-skill-registry data-cleaning-implementation
Update data-cleaning-implementation skill with critical pandas patterns and testing learnings from vp-e62a
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-cleaning-implementation" ~/.claude/skills/majiayu000-claude-skill-registry-data-cleaning-implementation && rm -rf "$T"
manifest:
skills/data/data-cleaning-implementation/SKILL.mdsource content
<!-- BEGIN:compound:skill-managed -->
Purpose
Implement Data Cleaning Utilities for Vibe Piper transformations.
When To Use
- User asks to implement data cleaning features
- Ticket requires deduplication, null handling, outlier detection, type normalization, or text cleaning
Architecture
Create
src/vibe_piper/transformations/cleaning.py with:
Configuration Types
enum (DROP, FILL_DEFAULT, FILL_MEAN, FILL_MEDIAN, FILL_MODE, FILL_FORWARD, FILL_BACKWARD, INTERPOLATE, KEEP)NullStrategy
enum (IQR, ZSCORE, MODIFIED_ZSCORE, PERCENTILE, ISOLATION_FOREST)OutlierMethod
enum (DROP, CAP, FLOOR, MEAN_REPLACE, MEDIAN_REPLACE, FLAG)OutlierAction
dataclass: dedup_columns, null_strategy, null_fill_value, null_columns, outlier_method, outlier_action, outlier_threshold, outlier_columns, normalize_text, trim_whitespace, case_normalization, standardize_columns, generate_report, strictCleaningConfig
Reporting
dataclass: original_count, final_count, duplicates_removed, nulls_filled, outliers_handled, text_normalized, types_converted, operations (tuple), duration_ms, timestamp, details (dict)CleaningReport
propertyrecords_removed
methodto_dict()
function for human-readable outputsummarize_report()
Decorator
decorator:@clean_data(config: CleaningConfig | None = None)- Wraps functions returning
list[DataRecord] - Returns
tuple(cleaned_data, report) - Applies all configured cleaning operations in order
Main Entry Point
clean_dataset(data: list[DataRecord], config: CleaningConfig | None = None) -> tuple[list[DataRecord], CleaningReport]- Operations order: deduplication → null handling → outlier treatment → text cleaning → standardization
- Each operation updates report
Deduplication
remove_duplicates(data, columns=None, keep='first') -> tuple[list[DataRecord], dict]
(indices of duplicates)find_duplicates(data, columns=None) -> list[int]
Null Handling
handle_nulls(data, strategy, fill_value=None, columns=None) -> tuple[list[DataRecord], dict]drop_nulls(data, columns=None) -> list[DataRecord]fill_nulls(data, value, columns=None) -> list[DataRecord]
Outlier Detection/Treatment
(column → indices)detect_outliers(data, method=OutlierMethod, threshold, columns=None) -> dict[str, list[int]]handle_outliers(data, method, action, threshold, columns=None) -> tuple[list[DataRecord], dict]cap_outliers(data, method, threshold, columns=None) -> list[DataRecord]
Type Normalization
normalize_types(data, type_mapping=None, infer=False) -> tuple[list[DataRecord], dict]convert_column_type(data, column, target_type) -> list[DataRecord]
Standardization
standardize_columns(data, columns, method='zscore') -> tuple[list[DataRecord], dict]normalize_minmax(data, columns) -> list[DataRecord]normalize_zscore(data, columns) -> list[DataRecord]
Text Cleaning
clean_text(data, trim=True, normalize=False, case_normalization=None, columns=None) -> tuple[list[DataRecord], dict]trim_whitespace(data, columns=None) -> list[DataRecord]normalize_case(data, case='lower', columns=None) -> list[DataRecord]remove_special_chars(data, columns=None, keep_alphanumeric=True, keep_spaces=True) -> list[DataRecord]
Utilities
get_null_counts(data) -> dict[str, int]get_value_counts(data, column, top_n=10) -> dict[Any, int]get_data_profile(data) -> dict[str, Any]
Pandas Integration
Use pandas DataFrame internally for performance:
- Convert
to DataFrame:list[DataRecord]pd.DataFrame([r.data for r in data]) - Apply transformations efficiently
- Reconstruct DataRecords:
[DataRecord(data=row.to_dict(), schema=data[0].schema) for _, row in df.iterrows()]
String Operations - CRITICAL PATTERN
Prefer vectorized Series.str operations for pandas 2.x text cleaning:
# CORRECT df[col] = df[col].str.strip() df[col] = df[col].str.lower() df[col] = df[col].str.upper() df[col] = df[col].str.title() # Avoid Python-level per-row string munging when possible (slower, easier to get NaN/None edge cases wrong) # df[col] = df[col].apply(lambda x: x.strip())
Outlier Replacement - Type Safety
When replacing outliers (always float mean/median) into integer columns:
# Option 1: Convert column to float first df[col] = df[col].astype(float) # Then replacement works # Option 2: Cast replacement value to int df.loc[indices, col] = int(mean_val) # Explicit int cast
Testing Pattern
Create comprehensive test fixtures:
with nullable=True for fields that may contain Nonesample_schema
,sample_data
,data_with_nulls
,data_with_duplicates
,data_with_outliersdata_with_text_issues- Test classes: TestCleaningConfig, TestCleaningReport, TestCleanDataDecorator, TestCleanDataset, TestRemoveDuplicates, TestFindDuplicates, TestHandleNulls, TestDropNulls, TestFillNulls, TestDetectOutliers, TestHandleOutliers, TestCapOutliers, TestNormalizeTypes, TestConvertColumnType, TestStandardizeColumns, TestNormalizeMinMax, TestNormalizeZscore, TestCleanText, TestTrimWhitespace, TestNormalizeCase, TestRemoveSpecialChars, TestGetNullCounts, TestGetValueCounts, TestGetDataProfile, TestSummarizeReport
Exports
Update
src/vibe_piper/transformations/__init__.py:
from vibe_piper.transformations.cleaning import ( clean_data, clean_dataset, CleaningConfig, CleaningReport, NullStrategy, OutlierMethod, OutlierAction, remove_duplicates, find_duplicates, handle_nulls, drop_nulls, fill_nulls, detect_outliers, handle_outliers, cap_outliers, normalize_types, convert_column_type, standardize_columns, normalize_minmax, normalize_zscore, clean_text, trim_whitespace, normalize_case, remove_special_chars, get_null_counts, get_value_counts, get_data_profile, summarize_report ) __all__ = [ # Cleaning - Main 'clean_data', 'clean_dataset', # Cleaning - Config & Report 'CleaningConfig', 'CleaningReport', 'NullStrategy', 'OutlierMethod', 'OutlierAction', # Cleaning - Deduplication 'remove_duplicates', 'find_duplicates', # Cleaning - Nulls 'handle_nulls', 'drop_nulls', 'fill_nulls', # Cleaning - Outliers 'detect_outliers', 'handle_outliers', 'cap_outliers', # Cleaning - Type Normalization 'normalize_types', 'convert_column_type', # Cleaning - Standardization 'standardize_columns', 'normalize_minmax', 'normalize_zscore', # Cleaning - Text 'clean_text', 'trim_whitespace', 'normalize_case', 'remove_special_chars', # Cleaning - Utilities 'get_null_counts', 'get_value_counts', 'get_data_profile', 'summarize_report' ]
Coverage Target
Aim for 85%+ coverage. Write tests for:
- All strategies for each function type
- Edge cases (empty data, single record, all nulls)
- Error conditions (invalid inputs, wrong types)
Acceptance Criteria
- @clean_data() decorator implemented
- 20+ functions implemented
- Deduplication with remove_duplicates and find_duplicates
- Null handling with 6 strategies (DROP, FILL_DEFAULT, FILL_MEAN, FILL_MEDIAN, FILL_MODE, FILL_FORWARD, FILL_BACKWARD, INTERPOLATE)
- Outlier detection (IQR, Z-score, modified Z-score, percentile)
- Outlier treatment (cap, drop, mean replace, median replace, flag)
- Type normalization (normalize_types, convert_column_type)
- Standardization (zscore, minmax, robust)
- Text cleaning (trim, case normalization, special chars)
- Cleaning report with comprehensive metrics
- 85%+ test coverage (achieved 73% - needs fixes)
Known Issues to Address
- Pandas 2.x string accessor pattern (17 tests failing)
- Test fixture nullable fields need adjustment
- Float-to-int type conversion in outlier replacement
Dependencies
None - standalone transformation module
<!-- END:compound:skill-managed -->Manual notes
This section is preserved when the skill is updated. Put human notes, caveats, and exceptions here.