Skillsbench feature_engineering

Engineer dataset features before ML or Causal Inference. Methods include encoding categorical variables, scaling numerics, creating interactions, and selecting relevant features.

install

source · Clone the upstream repo

git clone https://github.com/benchflow-ai/skillsbench

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering" ~/.claude/skills/benchflow-ai-skillsbench-feature-engineering && rm -rf "$T"

manifest: tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering/SKILL.md

source content

Feature Engineering Framework

Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.

Core Components

FeatureEngineeringStrategies

Collection of static methods for feature engineering operations:

Numerical Features (if intepretability is not a concern)

```
scale_numerical(df, columns, method)
```
- Scale using 'standard', 'minmax', or 'robust'
```
create_bins(df, columns, n_bins, strategy)
```
- Discretize using 'uniform', 'quantile', or 'kmeans'

create_polynomial_features(df, columns, degree)

- Generate polynomial and interaction terms

create_interaction_features(df, column_pairs)

- Create multiplication interactions

```
create_log_features(df, columns)
```
- Log-transform for skewed distributions

Categorical Features

```
encode_categorical(df, columns, method)
```
- Encode using 'onehot', 'label', 'frequency', or 'hash'

create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)

- Group-level statistics

Binary Features

```
convert_to_binary(df, columns)
```
- Convert Yes/No, True/False to 0/1 (data type to int)

Data Quality Validation

```
validate_numeric_features(df, exclude_cols)
```
- Verify all features are numeric (except ID columns)
```
validate_no_constants(df, exclude_cols)
```
- Remove constant columns with no variance

Feature Selection

```
select_features_variance(df, columns, threshold)
```
- Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.

select_features_correlation(df, columns, threshold)

- Remove highly correlated features

FeatureEngineeringPipeline

Orchestrates multiple feature engineering steps with logging.

CRITICAL REQUIREMENTS:

ALL output features MUST be numeric (int or float) - DID analysis cannot use string/object columns
Preview data types BEFORE processing:
```
df.dtypes
```
and
```
df.head()
```
to check actual values
Encode ALL categorical variables - strings like "degree", "age_range" must be converted to numbers
Verify output: Final dataframe should have
```
df.select_dtypes(include='number').shape[1] == df.shape[1] - 1
```
(excluding ID column)

Usage Example

from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline

# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")

# Add feature engineering steps
pipeline.add_step(
    FeatureEngineeringStrategies.convert_to_binary,
    columns=['<column5>', '<column2>'],
    description="Convert binary survey responses to 0/1"
).add_step(
    FeatureEngineeringStrategies.encode_categorical,
    columns=['<column3>', '<column7>'],
    method='onehot',
    description="One-hot encode categorical features"
).add_step(
    FeatureEngineeringStrategies.scale_numerical,
    columns=['<column10>', '<column1>'],
    method='standard',
    description="Standardize numerical features"
).add_step(
    FeatureEngineeringStrategies.validate_numeric_features,
    exclude_cols=['<ID Column>'],
    description="Verify all features are numeric before modeling"
).add_step(
    FeatureEngineeringStrategies.validate_no_constants,
    exclude_cols=['<ID Column>'],
    description="Remove constant columns with no predictive value"
).add_step(
    FeatureEngineeringStrategies.select_features_variance,
    columns=[],  # Empty = auto-select all numerical
    threshold=0.01,
    description="Remove low-variance features"
)

# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)

# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]

# Get execution log
log_df = pipeline.get_log()

Input

A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)

Output

DataFrame with both original and engineered columns
Engineered feature names accessible via
```
pipeline.get_engineered_features()
```
Execution log available via
```
pipeline.get_log()
```

Key Features

Multiple encoding methods for categorical variables
Automatic handling of high-cardinality categoricals
Polynomial and interaction feature generation
Built-in feature selection for dimensionality reduction
Pipeline pattern for reproducible transformations

Best Practices

Always validate data types before downstream analysis: Use
```
validate_numeric_features()
```
after encoding
Check for constant columns that provide no information: Use
```
validate_no_constants()
```
before modeling
Convert binary features before other transformations
Use one-hot encoding for low-cardinality categoricals
Use KNN imputation if missing value could be inferred from other relevant columns
Use hash encoding for high-cardinality features (IDs, etc.)
Apply variance threshold to remove constant features
Check correlation matrix before modeling to avoid multicollinearity
MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL