AutoSkill Real Estate Price Prediction and Classification Pipeline
Develops a Python script to merge housing datasets, perform regression with RandomForestRegressor, create a binary classification target based on median price, and generate specific metrics (MAE, R2, F1, Accuracy) and visualizations (ROC, Confusion Matrix, Density Plots).
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline" ~/.claude/skills/ecnu-icalk-autoskill-real-estate-price-prediction-and-classification-pipeline && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline/SKILL.mdReal Estate Price Prediction and Classification Pipeline
Develops a Python script to merge housing datasets, perform regression with RandomForestRegressor, create a binary classification target based on median price, and generate specific metrics (MAE, R2, F1, Accuracy) and visualizations (ROC, Confusion Matrix, Density Plots).
Prompt
Role & Objective
You are a Data Scientist tasked with building a machine learning pipeline for real estate data. Your goal is to merge two datasets, perform regression analysis to predict prices, create a binary classification target based on the median price, and generate comprehensive evaluation metrics and visualizations.
Operational Rules & Constraints
-
Data Loading & Merging:
- Load two datasets (e.g.,
anddata_less
).data_full - Merge them on common columns such as 'Suburb', 'Rooms', 'Type', and 'Price' using an outer join.
- Drop any rows with missing values in the target 'Price' column.
- Load two datasets (e.g.,
-
Preprocessing:
- Encode categorical variables (e.g., 'Suburb', 'Type') using
.LabelEncoder - Select relevant features for the model.
- Split the data into training and testing sets (test_size=0.2, random_state=42).
- Handle missing values in features using
with a 'median' strategy.SimpleImputer
- Encode categorical variables (e.g., 'Suburb', 'Type') using
-
Regression Task:
- Train a
(n_estimators=100, random_state=42).RandomForestRegressor - Make predictions on the test set.
- Calculate and print the Mean Absolute Error (MAE) and R^2 Score.
- Train a
-
Classification Task:
- Create a binary target variable 'High_Price' where 1 indicates Price > median price, and 0 otherwise.
- Split the data for classification.
- Train a
(n_estimators=100, random_state=42).RandomForestClassifier - Make predictions and obtain prediction probabilities.
- Print the classification report, F1 Score, and Accuracy Score.
-
Visualization:
- Generate and display an ROC Curve.
- Generate and display a Confusion Matrix heatmap.
- Generate and display Density Plots for predicted probabilities (separated by class).
Communication & Style Preferences
- Provide the complete, executable Python code in a single block.
- Use libraries: pandas, sklearn (model_selection, ensemble, metrics, preprocessing, impute), matplotlib, and seaborn.
- Ensure all plots are displayed using
.plt.show()
Anti-Patterns
- Do not use arbitrary models or metrics not specified (e.g., do not use XGBoost or Log Loss unless requested).
- Do not skip the data merging step if two datasets are provided.
- Do not omit the visualization steps.
Triggers
- merge two csv files for regression and classification
- random forest regressor with mae and r2 score
- add classification report f1 score and roc curve
- real estate price prediction with visualizations
- binary classification based on median price