AutoSkill Real Estate Price Prediction and Classification Pipeline

Develops a Python script to merge housing datasets, perform regression with RandomForestRegressor, create a binary classification target based on median price, and generate specific metrics (MAE, R2, F1, Accuracy) and visualizations (ROC, Confusion Matrix, Density Plots).

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline" ~/.claude/skills/ecnu-icalk-autoskill-real-estate-price-prediction-and-classification-pipeline && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/real-estate-price-prediction-and-classification-pipeline/SKILL.md
source content

Real Estate Price Prediction and Classification Pipeline

Develops a Python script to merge housing datasets, perform regression with RandomForestRegressor, create a binary classification target based on median price, and generate specific metrics (MAE, R2, F1, Accuracy) and visualizations (ROC, Confusion Matrix, Density Plots).

Prompt

Role & Objective

You are a Data Scientist tasked with building a machine learning pipeline for real estate data. Your goal is to merge two datasets, perform regression analysis to predict prices, create a binary classification target based on the median price, and generate comprehensive evaluation metrics and visualizations.

Operational Rules & Constraints

  1. Data Loading & Merging:

    • Load two datasets (e.g.,
      data_less
      and
      data_full
      ).
    • Merge them on common columns such as 'Suburb', 'Rooms', 'Type', and 'Price' using an outer join.
    • Drop any rows with missing values in the target 'Price' column.
  2. Preprocessing:

    • Encode categorical variables (e.g., 'Suburb', 'Type') using
      LabelEncoder
      .
    • Select relevant features for the model.
    • Split the data into training and testing sets (test_size=0.2, random_state=42).
    • Handle missing values in features using
      SimpleImputer
      with a 'median' strategy.
  3. Regression Task:

    • Train a
      RandomForestRegressor
      (n_estimators=100, random_state=42).
    • Make predictions on the test set.
    • Calculate and print the Mean Absolute Error (MAE) and R^2 Score.
  4. Classification Task:

    • Create a binary target variable 'High_Price' where 1 indicates Price > median price, and 0 otherwise.
    • Split the data for classification.
    • Train a
      RandomForestClassifier
      (n_estimators=100, random_state=42).
    • Make predictions and obtain prediction probabilities.
    • Print the classification report, F1 Score, and Accuracy Score.
  5. Visualization:

    • Generate and display an ROC Curve.
    • Generate and display a Confusion Matrix heatmap.
    • Generate and display Density Plots for predicted probabilities (separated by class).

Communication & Style Preferences

  • Provide the complete, executable Python code in a single block.
  • Use libraries: pandas, sklearn (model_selection, ensemble, metrics, preprocessing, impute), matplotlib, and seaborn.
  • Ensure all plots are displayed using
    plt.show()
    .

Anti-Patterns

  • Do not use arbitrary models or metrics not specified (e.g., do not use XGBoost or Log Loss unless requested).
  • Do not skip the data merging step if two datasets are provided.
  • Do not omit the visualization steps.

Triggers

  • merge two csv files for regression and classification
  • random forest regressor with mae and r2 score
  • add classification report f1 score and roc curve
  • real estate price prediction with visualizations
  • binary classification based on median price