Skillsbench feature_engineering
Engineer dataset features before ML or Causal Inference. Methods include encoding categorical variables, scaling numerics, creating interactions, and selecting relevant features.
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering" ~/.claude/skills/benchflow-ai-skillsbench-feature-engineering && rm -rf "$T"
manifest:
tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering/SKILL.mdsource content
Feature Engineering Framework
Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.
Core Components
FeatureEngineeringStrategies
Collection of static methods for feature engineering operations:
Numerical Features (if intepretability is not a concern)
- Scale using 'standard', 'minmax', or 'robust'scale_numerical(df, columns, method)
- Discretize using 'uniform', 'quantile', or 'kmeans'create_bins(df, columns, n_bins, strategy)
- Generate polynomial and interaction termscreate_polynomial_features(df, columns, degree)
- Create multiplication interactionscreate_interaction_features(df, column_pairs)
- Log-transform for skewed distributionscreate_log_features(df, columns)
Categorical Features
- Encode using 'onehot', 'label', 'frequency', or 'hash'encode_categorical(df, columns, method)
- Group-level statisticscreate_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)
Binary Features
- Convert Yes/No, True/False to 0/1 (data type to int)convert_to_binary(df, columns)
Data Quality Validation
- Verify all features are numeric (except ID columns)validate_numeric_features(df, exclude_cols)
- Remove constant columns with no variancevalidate_no_constants(df, exclude_cols)
Feature Selection
- Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.select_features_variance(df, columns, threshold)
- Remove highly correlated featuresselect_features_correlation(df, columns, threshold)
FeatureEngineeringPipeline
Orchestrates multiple feature engineering steps with logging.
CRITICAL REQUIREMENTS:
- ALL output features MUST be numeric (int or float) - DID analysis cannot use string/object columns
- Preview data types BEFORE processing:
anddf.dtypes
to check actual valuesdf.head() - Encode ALL categorical variables - strings like "degree", "age_range" must be converted to numbers
- Verify output: Final dataframe should have
(excluding ID column)df.select_dtypes(include='number').shape[1] == df.shape[1] - 1
Usage Example
from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline # Create pipeline pipeline = FeatureEngineeringPipeline(name="Demographics") # Add feature engineering steps pipeline.add_step( FeatureEngineeringStrategies.convert_to_binary, columns=['<column5>', '<column2>'], description="Convert binary survey responses to 0/1" ).add_step( FeatureEngineeringStrategies.encode_categorical, columns=['<column3>', '<column7>'], method='onehot', description="One-hot encode categorical features" ).add_step( FeatureEngineeringStrategies.scale_numerical, columns=['<column10>', '<column1>'], method='standard', description="Standardize numerical features" ).add_step( FeatureEngineeringStrategies.validate_numeric_features, exclude_cols=['<ID Column>'], description="Verify all features are numeric before modeling" ).add_step( FeatureEngineeringStrategies.validate_no_constants, exclude_cols=['<ID Column>'], description="Remove constant columns with no predictive value" ).add_step( FeatureEngineeringStrategies.select_features_variance, columns=[], # Empty = auto-select all numerical threshold=0.01, description="Remove low-variance features" ) # Execute pipeline # df_complete: complete returns original columns and the engineered features df_complete = pipeline.execute(your_cleaned_df, verbose=True) # Shortcut: Get the ID Column with the all needed enigneered features engineered_features = pipeline.get_engineered_features() df_id_pure_features = df_complete[['<ID Column>']+engineered_features] # Get execution log log_df = pipeline.get_log()
Input
- A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)
Output
- DataFrame with both original and engineered columns
- Engineered feature names accessible via
pipeline.get_engineered_features() - Execution log available via
pipeline.get_log()
Key Features
- Multiple encoding methods for categorical variables
- Automatic handling of high-cardinality categoricals
- Polynomial and interaction feature generation
- Built-in feature selection for dimensionality reduction
- Pipeline pattern for reproducible transformations
Best Practices
- Always validate data types before downstream analysis: Use
after encodingvalidate_numeric_features() - Check for constant columns that provide no information: Use
before modelingvalidate_no_constants() - Convert binary features before other transformations
- Use one-hot encoding for low-cardinality categoricals
- Use KNN imputation if missing value could be inferred from other relevant columns
- Use hash encoding for high-cardinality features (IDs, etc.)
- Apply variance threshold to remove constant features
- Check correlation matrix before modeling to avoid multicollinearity
- MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL