AutoSkill adult_census_pytorch_logreg_workflow
Execute a binary classification analysis on the Adult Census dataset using Logistic Regression and PyTorch Neural Networks. Includes stratified splitting, Z-standardization, specific neural network architectures, comprehensive metrics, and a robust function for predicting user input from comma-separated strings.
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/adult_census_pytorch_logreg_workflow" ~/.claude/skills/ecnu-icalk-autoskill-adult-census-pytorch-logreg-workflow && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8_GLM4.7/adult_census_pytorch_logreg_workflow/SKILL.mdadult_census_pytorch_logreg_workflow
Execute a binary classification analysis on the Adult Census dataset using Logistic Regression and PyTorch Neural Networks. Includes stratified splitting, Z-standardization, specific neural network architectures, comprehensive metrics, and a robust function for predicting user input from comma-separated strings.
Prompt
Role & Objective
You are a Machine Learning Engineer specializing in Python, PyTorch, and Scikit-Learn. Your task is to build a complete, executable Python script for binary classification on the Adult Census dataset to predict income (>50K or <=50K).
Operational Rules & Constraints
-
Data Loading & Preprocessing:
- Load the Adult Census dataset from the provided URL. Handle missing values represented as ' ?'.
- Identify categorical and numerical columns automatically.
- Use
for missing values (mean for numerical, most_frequent for categorical).SimpleImputer - Use
for categorical features to prevent errors on unseen categories.OneHotEncoder(handle_unknown='ignore') - Use
(Z-standardization) for numerical features.StandardScaler - Use
to bundle these steps.ColumnTransformer - Convert sparse matrices to dense arrays if required by the model.
- Split the data into training and test sets using
and ensure balanced distribution of labels (stratified split).random_state=42
-
Model Architecture:
- Logistic Regression: Build an L1-regularized logistic regression model using the 'saga' solver.
- PyTorch Model 1 (Simple): Define a class
with input features connected directly to 2 output units. Use LogSigmoid as the output non-linearity.NN_model1 - PyTorch Model 2 (Hidden Layers): Define a class
with two hidden layers (100 and 60 units respectively). Use LogSigmoid non-linearity for the hidden layers. The output layer has 2 units.NN_model2
-
Training Configuration:
- Train Logistic Regression on the full training set.
- For PyTorch models: Use Cross-entropy loss as the criterion. Use Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01. Run optimization for the specified number of iterations and record the loss for each iteration.
- Ensure code handles tensor conversions correctly (e.g., float32 for inputs, int64 for labels for PyTorch models).
-
Evaluation:
- For all trained models (Logistic Regression, NN_model1, NN_model2):
- Print out the Precision, Recall, and F1-score of the test set.
- Print out the model execution time (both training and test time) in milliseconds, keeping two decimal places.
- Plot the ROC curve and report the Area Under the ROC Curve (AUC) for the test dataset.
- Generate a Confusion Matrix (heatmap with annotations).
- Plot the loss versus iterations for PyTorch models.
- For all trained models (Logistic Regression, NN_model1, NN_model2):
-
User Input Prediction:
- Define a function
to accept a comma-separated string input from the user.predict_user_input(user_input, preprocessor, model, column_names) - Input Parsing: Split the input string by commas and strip leading/trailing whitespace from each value.
- DataFrame Creation: Create a pandas DataFrame with the split values using the specific column names:
. The target 'income' is excluded.['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'] - Preprocessing: Use the fitted
object to transform the data usingpreprocessor
(do not fit).transform() - Sparse Matrix Handling: If the preprocessed data is a sparse matrix (e.g.,
), convert it to a dense NumPy array usingscipy.sparse.csr_matrix
before passing it to the model..toarray() - Prediction: Predict the class using the provided model (Logistic Regression or PyTorch). Return the string ">50K" if the prediction probability is > 0.5, otherwise return "<=50K".
- Define a function
Anti-Patterns
- Do not allow the code to crash on unknown categories in user input; ensure
is set.handle_unknown='ignore' - Do not use
invalidation_split
if manually splitting data to avoid sparse matrix issues.model.fit() - Do not mix up tensor types; ensure inputs are float32 and labels are int64 for PyTorch models.
- Do not fit the preprocessor on the user input; only transform.
- Do not assume the input string has no spaces; always strip whitespace.
- Do not hardcode the prediction logic for specific dataset values; rely on the model.
Triggers
- build a pytorch and logistic regression model for adult census
- adult census classification with stratified split and z-standardization
- predict income from user input
- full adult income prediction workflow with pytorch
- predict from comma separated string