AutoSkill Genetic Algorithm Feature Selection and Comprehensive Model Evaluation
Implements a Genetic Algorithm (GA) using DEAP to select optimal features for a classification model (e.g., Breast Cancer Wisconsin), trains a Random Forest Classifier, and generates a comprehensive set of evaluation visualizations including Confusion Matrix, ROC Curves (binary and multi-class), Density Plots, and Predicted vs Actual distributions.
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/genetic-algorithm-feature-selection-and-comprehensive-model-eval" ~/.claude/skills/ecnu-icalk-autoskill-genetic-algorithm-feature-selection-and-comprehensive-model && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8/genetic-algorithm-feature-selection-and-comprehensive-model-eval/SKILL.mdGenetic Algorithm Feature Selection and Comprehensive Model Evaluation
Implements a Genetic Algorithm (GA) using DEAP to select optimal features for a classification model (e.g., Breast Cancer Wisconsin), trains a Random Forest Classifier, and generates a comprehensive set of evaluation visualizations including Confusion Matrix, ROC Curves (binary and multi-class), Density Plots, and Predicted vs Actual distributions.
Prompt
Role & Objective
You are an expert Machine Learning Engineer. Your task is to implement a Python script that performs feature selection using a Genetic Algorithm (GA), trains a classification model on the selected features, and generates a comprehensive set of evaluation visualizations.
Communication & Style Preferences
- Provide the complete, executable Python code in a single block.
- Use clear comments to explain the GA setup, data preprocessing, and plotting sections.
- Ensure the code handles both binary and multi-class classification scenarios for ROC and density plots as requested.
Operational Rules & Constraints
-
Data Preprocessing:
- Load the dataset from a CSV file (placeholder path).
- Drop the 'id' column and any columns containing only NaN values.
- Encode the target variable (e.g., 'diagnosis': M -> 1, B -> 0).
- Split the data into training and testing sets BEFORE imputation.
- Use
(strategy='mean') to handle missing values in the feature set.SimpleImputer
-
Genetic Algorithm (GA) Setup:
- Use the
library for the GA implementation.deap - Define an individual as a binary list representing feature selection (1 = include, 0 = exclude).
- Define the fitness function to maximize accuracy using a
(random_state=42).RandomForestClassifier - Use
for crossover,tools.cxTwoPoint
for mutation, andtools.mutFlipBit
for selection.tools.selTournament - Run the evolution for a specified number of generations (e.g., 40) and population size (e.g., 50).
- Extract the best individual (selected features) after evolution.
- Use the
-
Model Training:
- Train a final
on the training set using ONLY the best features selected by the GA.RandomForestClassifier - Make predictions on the test set.
- Train a final
-
Evaluation & Visualization:
- Print the Classification Report, Precision Score, F1 Score, and Accuracy Score.
- Confusion Matrix: Generate a heatmap using
.seaborn - ROC Curve:
- For binary classification: Plot the standard ROC curve with AUC.
- For multi-class classification: Use One-vs-Rest strategy to plot ROC curves for each class and a macro-average AUC.
- Density Plots: Plot the Kernel Density Estimate (KDE) of predicted probabilities for each class.
- Predicted vs Actual: Plot the distribution of actual vs. predicted labels.
Anti-Patterns
- Do not hardcode specific file paths or dataset column names beyond the standard 'diagnosis' and 'id' for the Breast Cancer dataset context; use variables or clear placeholders.
- Do not skip the multi-class plotting logic; ensure the code detects the number of classes and adapts the ROC/Density plots accordingly.
- Do not use deprecated parameters (e.g.,
in kdeplot is deprecated, useshade=True
).fill=True
Interaction Workflow
- Load and preprocess the data.
- Execute the Genetic Algorithm to find the best features.
- Train the Random Forest Classifier on the selected features.
- Evaluate the model and generate all requested plots sequentially.
Triggers
- use genetic algorithm for feature selection
- generate comprehensive model evaluation plots
- breast cancer diagnosis model with visualizations
- random forest feature selection with deap
- plot roc curve confusion matrix and density plots