Claude-skill-registry feature-engineer
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/feature-engineer" ~/.claude/skills/majiayu000-claude-skill-registry-feature-engineer && rm -rf "$T"
skills/data/feature-engineer/SKILL.mdFeature Engineer
Overview
Feature engineering often makes the difference between mediocre and excellent ML models. This skill transforms raw data into model-ready features through systematic data quality assessment, feature creation, selection, and transformation—all integrated with SpecWeave's increment workflow.
The Feature Engineering Pipeline
Phase 1: Data Quality Assessment
Before creating features, understand your data:
from specweave import DataQualityReport # Automated data quality check report = DataQualityReport(df, increment="0042") # Generates: # - Missing value analysis # - Outlier detection # - Data type validation # - Distribution analysis # - Correlation matrix # - Duplicate detection
Quality Report Output:
# Data Quality Report ## Dataset Overview - Rows: 100,000 - Columns: 45 - Memory: 34.2 MB ## Missing Values | Column | Missing | Percentage | |-----------------|---------|------------| | email | 15,234 | 15.2% | | phone | 8,901 | 8.9% | | purchase_date | 0 | 0.0% | ## Outliers Detected - transaction_amount: 234 outliers (>3 std dev) - user_age: 12 outliers (<18 or >100) ## Data Type Issues - user_id: Stored as float, should be int - date_joined: Stored as string, should be datetime ## Recommendations 1. Impute email/phone or create "missing" indicator features 2. Cap/remove outliers in transaction_amount 3. Convert data types for efficiency
Phase 2: Feature Creation
Create features from domain knowledge:
from specweave import FeatureCreator creator = FeatureCreator(df, increment="0042") # Temporal features (from datetime) creator.add_temporal_features( date_column="purchase_date", features=["hour", "day_of_week", "month", "is_weekend", "is_holiday"] ) # Aggregation features (user behavior) creator.add_aggregation_features( group_by="user_id", target="purchase_amount", aggs=["mean", "std", "count", "min", "max"] ) # Creates: user_purchase_amount_mean, user_purchase_amount_std, etc. # Interaction features creator.add_interaction_features( features=[("age", "income"), ("clicks", "impressions")], operations=["multiply", "divide", "subtract"] ) # Creates: age_x_income, clicks_per_impression, etc. # Ratio features creator.add_ratio_features([ ("revenue", "cost"), ("conversions", "visits") ]) # Creates: revenue_to_cost_ratio, conversion_rate # Binning (discretization) creator.add_binned_features( column="age", bins=[0, 18, 25, 35, 50, 65, 100], labels=["child", "young_adult", "adult", "middle_aged", "senior", "elderly"] ) # Text features (from text columns) creator.add_text_features( column="product_description", features=["length", "word_count", "unique_words", "sentiment"] ) # Generate all features df_enriched = creator.generate() # Auto-documents in increment folder creator.save_feature_definitions( path=".specweave/increments/0042.../features/feature_definitions.yaml" )
Feature Definitions (auto-generated):
# .specweave/increments/0042.../features/feature_definitions.yaml features: - name: purchase_hour type: temporal source: purchase_date description: Hour of purchase (0-23) - name: user_purchase_amount_mean type: aggregation source: purchase_amount group_by: user_id description: Average purchase amount per user - name: age_x_income type: interaction sources: [age, income] operation: multiply description: Product of age and income - name: conversion_rate type: ratio sources: [conversions, visits] description: Conversion rate (conversions / visits)
Phase 3: Feature Selection
Reduce dimensionality, improve performance:
from specweave import FeatureSelector selector = FeatureSelector(X_train, y_train, increment="0042") # Method 1: Correlation-based (remove redundant features) selector.remove_correlated_features(threshold=0.95) # Removes features with >95% correlation # Method 2: Variance-based (remove constant features) selector.remove_low_variance_features(threshold=0.01) # Removes features with <1% variance # Method 3: Statistical tests selector.select_by_statistical_test(k=50) # SelectKBest with chi2/f_classif # Method 4: Model-based (tree importance) selector.select_by_model_importance( model=RandomForestClassifier(), threshold=0.01 ) # Removes features with <1% importance # Method 5: Recursive Feature Elimination selector.select_by_rfe( model=LogisticRegression(), n_features=30 ) # Get selected features selected_features = selector.get_selected_features() # Generate selection report selector.generate_report()
Feature Selection Report:
# Feature Selection Report ## Original Features: 125 ## Selected Features: 35 (72% reduction) ## Selection Process 1. Removed 12 correlated features (>95% correlation) 2. Removed 8 low-variance features 3. Statistical test: Selected top 50 (chi-squared) 4. Model importance: Removed 15 low-importance features (<1%) ## Top 10 Features (by importance) 1. user_purchase_amount_mean (0.18) 2. days_since_last_purchase (0.12) 3. total_purchases (0.10) 4. age_x_income (0.08) 5. conversion_rate (0.07) ... ## Removed Features - user_id_hash (constant) - temp_feature_1 (99% correlated with temp_feature_2) - random_noise (0% importance) ...
Phase 4: Feature Transformation
Scale, normalize, encode for model compatibility:
from specweave import FeatureTransformer transformer = FeatureTransformer(increment="0042") # Numerical transformations transformer.add_numerical_transformer( columns=["age", "income", "purchase_amount"], method="standard_scaler" # Or: min_max, robust, quantile ) # Categorical encoding transformer.add_categorical_encoder( columns=["country", "device_type", "product_category"], method="onehot", # Or: label, target, binary handle_unknown="ignore" ) # Ordinal encoding (for ordered categories) transformer.add_ordinal_encoder( column="education", order=["high_school", "bachelors", "masters", "phd"] ) # Log transformation (for skewed distributions) transformer.add_log_transform( columns=["transaction_amount", "page_views"], method="log1p" # log(1 + x) to handle zeros ) # Box-Cox transformation (for normalization) transformer.add_power_transform( columns=["revenue", "engagement_score"], method="box-cox" ) # Custom transformation def clip_outliers(x): return np.clip(x, x.quantile(0.01), x.quantile(0.99)) transformer.add_custom_transformer( columns=["outlier_prone_feature"], func=clip_outliers ) # Fit and transform X_train_transformed = transformer.fit_transform(X_train) X_test_transformed = transformer.transform(X_test) # Save transformer pipeline transformer.save( path=".specweave/increments/0042.../features/transformer.pkl" )
Phase 5: Feature Validation
Ensure features are production-ready:
from specweave import FeatureValidator validator = FeatureValidator( X_train, X_test, increment="0042" ) # Check for data leakage leakage_report = validator.check_data_leakage() # Detects: perfectly correlated features, future data in training # Check for distribution drift drift_report = validator.check_distribution_drift() # Compares train vs test distributions # Check for missing values after transformation missing_report = validator.check_missing_values() # Check for infinite/NaN values invalid_report = validator.check_invalid_values() # Generate validation report validator.generate_report()
Validation Report:
# Feature Validation Report ## Data Leakage: ✅ PASS No perfect correlations detected between train and test. ## Distribution Drift: ⚠️ WARNING Features with significant drift (KS test p < 0.05): - user_age: p=0.023 (minor drift) - device_type: p=0.001 (major drift) Recommendation: Check if test data is from different time period. ## Missing Values: ✅ PASS No missing values after transformation. ## Invalid Values: ✅ PASS No infinite or NaN values detected. ## Overall: READY FOR TRAINING 2 warnings, 0 critical issues.
Integration with SpecWeave
Automatic Feature Documentation
# All feature engineering steps logged to increment with track_experiment("feature-engineering-v1", increment="0042") as exp: # Create features df_enriched = creator.generate() # Select features selected = selector.select() # Transform features X_transformed = transformer.fit_transform(X) # Validate validation = validator.validate() # Auto-logs: exp.log_param("original_features", 125) exp.log_param("created_features", 45) exp.log_param("selected_features", 35) exp.log_metric("feature_reduction", 0.72) exp.save_artifact("feature_definitions.yaml") exp.save_artifact("transformer.pkl") exp.save_artifact("validation_report.md")
Living Docs Integration
After completing feature engineering:
/sw:sync-docs update
Updates:
<!-- .specweave/docs/internal/architecture/feature-engineering.md --> ## Recommendation Model Features (Increment 0042) ### Feature Engineering Pipeline 1. Data Quality: 100K rows, 45 columns 2. Created: 45 new features (temporal, aggregation, interaction) 3. Selected: 35 features (72% reduction via importance + RFE) 4. Transformed: StandardScaler for numerical, OneHot for categorical ### Key Features - user_purchase_amount_mean: Average user spend (top feature, 18% importance) - days_since_last_purchase: Recency indicator (12% importance) - age_x_income: Interaction feature (8% importance) ### Feature Store All features documented in: `.specweave/increments/0042.../features/` - feature_definitions.yaml: Feature catalog - transformer.pkl: Production transformation pipeline - validation_report.md: Quality checks
Best Practices
1. Document Feature Rationale
# Bad: Create features without explanation df["feature_1"] = df["col_a"] * df["col_b"] # Good: Document why features were created creator.add_interaction_feature( sources=["age", "income"], operation="multiply", rationale="High-income older users have different behavior patterns" )
2. Handle Missing Values Systematically
# Options for missing values: # 1. Imputation (mean, median, mode) creator.impute_missing(column="age", strategy="median") # 2. Indicator features (flag missing as signal) creator.add_missing_indicator(column="email") # Creates: email_missing (0/1) # 3. Forward/backward fill (for time series) creator.fill_missing(column="sensor_reading", method="ffill") # 4. Model-based imputation creator.impute_with_model(column="income", model=RandomForestRegressor())
3. Avoid Data Leakage
# ❌ WRONG: Fit on all data (includes test set!) scaler.fit(X) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # ✅ CORRECT: Fit only on train, transform both scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # SpecWeave's transformer enforces this pattern transformer.fit_transform(X_train) # Fits transformer.transform(X_test) # Only transforms
4. Version Feature Engineering Pipeline
# Version features with increment transformer.save( path=".specweave/increments/0042.../features/transformer-v1.pkl", metadata={ "version": "v1", "features": selected_features, "transformations": ["standard_scaler", "onehot"] } ) # Load specific version for reproducibility transformer_v1 = FeatureTransformer.load( ".specweave/increments/0042.../features/transformer-v1.pkl" )
5. Test Feature Engineering on New Data
# Before deploying, test on held-out data X_production_sample = load_production_data() try: X_transformed = transformer.transform(X_production_sample) except Exception as e: raise FeatureEngineeringError(f"Failed on production data: {e}") # Check for unexpected values validator = FeatureValidator(X_train, X_production_sample) validation_report = validator.validate() if validation_report["status"] == "CRITICAL": raise FeatureEngineeringError("Feature engineering failed validation")
Common Feature Engineering Patterns
Pattern 1: RFM (Recency, Frequency, Monetary)
# For e-commerce / customer analytics creator.add_rfm_features( user_id="user_id", transaction_date="purchase_date", transaction_amount="purchase_amount" ) # Creates: # - recency: days since last purchase # - frequency: total purchases # - monetary: total spend
Pattern 2: Rolling Window Aggregations
# For time series creator.add_rolling_features( column="daily_sales", windows=[7, 14, 30], aggs=["mean", "std", "min", "max"] ) # Creates: daily_sales_7day_mean, daily_sales_7day_std, etc.
Pattern 3: Target Encoding (Categorical → Numerical)
# Encode categorical as target mean (careful: can leak!) creator.add_target_encoding( column="product_category", target="purchase_amount", cv_folds=5 # Cross-validation to prevent leakage ) # Creates: product_category_target_encoded
Pattern 4: Polynomial Features
# For non-linear relationships creator.add_polynomial_features( columns=["age", "income"], degree=2, interaction_only=True ) # Creates: age^2, income^2, age*income
Commands
# Generate feature engineering pipeline for increment /ml:engineer-features 0042 # Validate features before training /ml:validate-features 0042 # Generate feature importance report /ml:feature-importance 0042
Integration with Other Skills
- ml-pipeline-orchestrator: Task 2 is "Feature Engineering" (uses this skill)
- experiment-tracker: Logs all feature engineering experiments
- model-evaluator: Uses feature importance from models
- ml-deployment-helper: Packages feature transformer for production
Summary
Feature engineering is 70% of ML success. This skill ensures:
- ✅ Systematic approach (quality → create → select → transform → validate)
- ✅ No data leakage (train/test separation enforced)
- ✅ Production-ready (versioned, validated, documented)
- ✅ Reproducible (all steps tracked in increment)
- ✅ Traceable (feature definitions in living docs)
Good features make mediocre models great. Great features make mediocre models excellent.