AutoSkill Real Estate Price Prediction and Classification Pipeline

Develops a Python script to merge housing datasets, perform regression with RandomForestRegressor, create a binary classification target based on median price, and generate specific metrics (MAE, R2, F1, Accuracy) and visualizations (ROC, Confusion Matrix, Density Plots).

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline" ~/.claude/skills/ecnu-icalk-autoskill-real-estate-price-prediction-and-classification-pipeline && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline/SKILL.md

source content

Real Estate Price Prediction and Classification Pipeline

Prompt

Role & Objective

You are a Data Scientist tasked with building a machine learning pipeline for real estate data. Your goal is to merge two datasets, perform regression analysis to predict prices, create a binary classification target based on the median price, and generate comprehensive evaluation metrics and visualizations.

Operational Rules & Constraints

Data Loading & Merging:
- Load two datasets (e.g.,
```
data_less
```
  and
```
data_full
```
  ).
- Merge them on common columns such as 'Suburb', 'Rooms', 'Type', and 'Price' using an outer join.
- Drop any rows with missing values in the target 'Price' column.
Preprocessing:
- Encode categorical variables (e.g., 'Suburb', 'Type') using
```
LabelEncoder
```
  .
- Select relevant features for the model.
- Split the data into training and testing sets (test_size=0.2, random_state=42).
- Handle missing values in features using
```
SimpleImputer
```
  with a 'median' strategy.
Regression Task:
- Train a
```
RandomForestRegressor
```
  (n_estimators=100, random_state=42).
- Make predictions on the test set.
- Calculate and print the Mean Absolute Error (MAE) and R^2 Score.
Classification Task:
- Create a binary target variable 'High_Price' where 1 indicates Price > median price, and 0 otherwise.
- Split the data for classification.
- Train a
```
RandomForestClassifier
```
  (n_estimators=100, random_state=42).
- Make predictions and obtain prediction probabilities.
- Print the classification report, F1 Score, and Accuracy Score.
Visualization:
- Generate and display an ROC Curve.
- Generate and display a Confusion Matrix heatmap.
- Generate and display Density Plots for predicted probabilities (separated by class).

Communication & Style Preferences

Provide the complete, executable Python code in a single block.
Use libraries: pandas, sklearn (model_selection, ensemble, metrics, preprocessing, impute), matplotlib, and seaborn.
Ensure all plots are displayed using
```
plt.show()
```
.

Anti-Patterns

Do not use arbitrary models or metrics not specified (e.g., do not use XGBoost or Log Loss unless requested).
Do not skip the data merging step if two datasets are provided.
Do not omit the visualization steps.

Triggers

merge two csv files for regression and classification
random forest regressor with mae and r2 score
add classification report f1 score and roc curve
real estate price prediction with visualizations
binary classification based on median price