Skillsbench feature_engineering

Engineer dataset features before ML or Causal Inference. Methods include encoding categorical variables, scaling numerics, creating interactions, and selecting relevant features.

install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering" ~/.claude/skills/benchflow-ai-skillsbench-feature-engineering && rm -rf "$T"
manifest: tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering/SKILL.md
source content

Feature Engineering Framework

Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.

Core Components

FeatureEngineeringStrategies

Collection of static methods for feature engineering operations:

Numerical Features (if intepretability is not a concern)

  • scale_numerical(df, columns, method)
    - Scale using 'standard', 'minmax', or 'robust'
  • create_bins(df, columns, n_bins, strategy)
    - Discretize using 'uniform', 'quantile', or 'kmeans'
  • create_polynomial_features(df, columns, degree)
    - Generate polynomial and interaction terms
  • create_interaction_features(df, column_pairs)
    - Create multiplication interactions
  • create_log_features(df, columns)
    - Log-transform for skewed distributions

Categorical Features

  • encode_categorical(df, columns, method)
    - Encode using 'onehot', 'label', 'frequency', or 'hash'
  • create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)
    - Group-level statistics

Binary Features

  • convert_to_binary(df, columns)
    - Convert Yes/No, True/False to 0/1 (data type to int)

Data Quality Validation

  • validate_numeric_features(df, exclude_cols)
    - Verify all features are numeric (except ID columns)
  • validate_no_constants(df, exclude_cols)
    - Remove constant columns with no variance

Feature Selection

  • select_features_variance(df, columns, threshold)
    - Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.
  • select_features_correlation(df, columns, threshold)
    - Remove highly correlated features

FeatureEngineeringPipeline

Orchestrates multiple feature engineering steps with logging.

CRITICAL REQUIREMENTS:

  1. ALL output features MUST be numeric (int or float) - DID analysis cannot use string/object columns
  2. Preview data types BEFORE processing:
    df.dtypes
    and
    df.head()
    to check actual values
  3. Encode ALL categorical variables - strings like "degree", "age_range" must be converted to numbers
  4. Verify output: Final dataframe should have
    df.select_dtypes(include='number').shape[1] == df.shape[1] - 1
    (excluding ID column)

Usage Example

from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline

# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")

# Add feature engineering steps
pipeline.add_step(
    FeatureEngineeringStrategies.convert_to_binary,
    columns=['<column5>', '<column2>'],
    description="Convert binary survey responses to 0/1"
).add_step(
    FeatureEngineeringStrategies.encode_categorical,
    columns=['<column3>', '<column7>'],
    method='onehot',
    description="One-hot encode categorical features"
).add_step(
    FeatureEngineeringStrategies.scale_numerical,
    columns=['<column10>', '<column1>'],
    method='standard',
    description="Standardize numerical features"
).add_step(
    FeatureEngineeringStrategies.validate_numeric_features,
    exclude_cols=['<ID Column>'],
    description="Verify all features are numeric before modeling"
).add_step(
    FeatureEngineeringStrategies.validate_no_constants,
    exclude_cols=['<ID Column>'],
    description="Remove constant columns with no predictive value"
).add_step(
    FeatureEngineeringStrategies.select_features_variance,
    columns=[],  # Empty = auto-select all numerical
    threshold=0.01,
    description="Remove low-variance features"
)

# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)

# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]

# Get execution log
log_df = pipeline.get_log()

Input

  • A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)

Output

  • DataFrame with both original and engineered columns
  • Engineered feature names accessible via
    pipeline.get_engineered_features()
  • Execution log available via
    pipeline.get_log()

Key Features

  • Multiple encoding methods for categorical variables
  • Automatic handling of high-cardinality categoricals
  • Polynomial and interaction feature generation
  • Built-in feature selection for dimensionality reduction
  • Pipeline pattern for reproducible transformations

Best Practices

  • Always validate data types before downstream analysis: Use
    validate_numeric_features()
    after encoding
  • Check for constant columns that provide no information: Use
    validate_no_constants()
    before modeling
  • Convert binary features before other transformations
  • Use one-hot encoding for low-cardinality categoricals
  • Use KNN imputation if missing value could be inferred from other relevant columns
  • Use hash encoding for high-cardinality features (IDs, etc.)
  • Apply variance threshold to remove constant features
  • Check correlation matrix before modeling to avoid multicollinearity
  • MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL