STAT 434 — Statistical Learning: Methods and Applications

Predicting Housing Prices with Statistical Learning

Comparing regularized regression, tree-based methods, and ensemble models on the Ames Housing dataset — with rigorous cross-validation and interpretability analysis via SHAP values.

Python scikit-learn · XGBoost · SHAP Supervised Learning

Overview

Housing price prediction is a canonical regression problem that tests a practitioner's ability to handle messy real-world data: mixed variable types, multicollinearity, non-linear relationships, skewed distributions, and missing values. This project applies a structured statistical learning pipeline to the Ames Housing dataset (2,930 residential sales in Ames, Iowa), comparing six modeling approaches from simple baselines to state-of-the-art gradient boosting.

Beyond raw prediction accuracy, the project emphasizes the bias-variance tradeoff, proper cross-validation methodology, feature engineering decisions, and model interpretability. The final Gradient Boosted Trees model achieves an RMSE of $16,840 on the held-out test set, and SHAP analysis reveals which features drive individual predictions — making the model useful for both appraisals and market analysis.

Lasso / Ridge / Elastic Net Random Forest Gradient Boosting (XGBoost) Cross-Validation SHAP Values Feature Engineering

Data Preparation

The Ames dataset contains 80 features describing nearly every aspect of a home. My preprocessing pipeline addressed several challenges: imputing missing values (distinguishing between true missingness and "not applicable" for features like pool quality), log-transforming the skewed target variable, encoding ordinal quality ratings as numeric scales, and creating interaction features for total living area and overall quality.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Load and split data
df = pd.read_csv("ames_housing.csv")
X = df.drop("SalePrice", axis=1)
y = np.log1p(df["SalePrice"])  # Log-transform target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature engineering
X_train["TotalSF"] = (X_train["1stFlrSF"] +
                        X_train["2ndFlrSF"] +
                        X_train["TotalBsmtSF"])
X_train["QualxSF"] = X_train["OverallQual"] * X_train["TotalSF"]
X_train["Age"] = X_train["YrSold"] - X_train["YearBuilt"]
X_train["RemodAge"] = X_train["YrSold"] - X_train["YearRemodAdd"]

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_train.shape[1]}")

Model Comparison

I evaluated six models using 10-fold cross-validation on the training set, then computed final test metrics on the held-out 20%. All models were tuned via grid search or randomized search over their respective hyperparameter spaces. RMSE values are reported on the original dollar scale (after inverse log-transforming predictions).

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

models = {
    "OLS Regression": LinearRegression(),
    "Ridge (α=12)": Ridge(alpha=12),
    "Lasso (α=0.0005)": Lasso(alpha=0.0005),
    "Elastic Net": ElasticNet(alpha=0.0005, l1_ratio=0.7),
    "Random Forest": RandomForestRegressor(
        n_estimators=500, max_depth=12, min_samples_leaf=3
    ),
    "XGBoost": XGBRegressor(
        n_estimators=800, max_depth=5, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1
    )
}

for name, model in models.items():
    cv_scores = cross_val_score(
        model, X_train_processed, y_train,
        cv=10, scoring="neg_root_mean_squared_error"
    )
    print(f"{name:20s} CV RMSE: ${-cv_scores.mean()*1000:,.0f} (± ${cv_scores.std()*1000:,.0f})")
Model CV RMSE Test RMSE Test R²
OLS Linear Regression $24,310 $25,040 0.882
Ridge Regression (α=12) $22,870 $23,190 0.899
Lasso (α=0.0005) $21,950 $22,410 0.905
Elastic Net (α=0.0005, l1=0.7) $22,100 $22,680 0.903
Random Forest $18,420 $19,150 0.931
XGBoost ★ $16,180 $16,840 0.947
$16,840 Best Test RMSE
0.947 Best Test R²
84 Features Used

Model Interpretability with SHAP

A prediction model is only useful if stakeholders trust and understand it. I used SHAP (SHapley Additive exPlanations) to decompose the XGBoost model's predictions into individual feature contributions. SHAP values are grounded in cooperative game theory and provide consistent, locally accurate attributions.

import shap

# Compute SHAP values for test set
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_processed)

# Global feature importance (mean |SHAP|)
shap.summary_plot(shap_values, X_test_processed, plot_type="bar",
                  max_display=15, show=False)

# SHAP beeswarm plot — feature value × impact
shap.summary_plot(shap_values, X_test_processed, max_display=15)

Top 10 Features by Mean |SHAP Value|

Global Feature Importance (Mean |SHAP Value|) OverallQual 0.182 TotalSF 0.156 GrLivArea 0.119 QualxSF 0.107 GarageCars 0.080 Age 0.072 TotalBsmtSF 0.064 Neighborhood 0.058 FullBath 0.045 Fireplaces 0.039

Overall quality and total square footage dominate, but the interaction feature (QualxSF) also ranks highly — validating the feature engineering decision.

Bias-Variance Analysis

The progression from OLS to XGBoost illustrates the bias-variance tradeoff in action. OLS has the highest bias (underfitting complex non-linear relationships) but lowest variance. Lasso reduces variance further through feature selection — zeroing out 28 of the 84 features. Random Forest dramatically reduces bias through non-linear splits but at slightly higher variance. XGBoost achieves the best of both worlds: the sequential boosting procedure reduces bias while regularization parameters (max_depth, subsample, reg_alpha) control variance.

The gap between CV RMSE and test RMSE is small across all models (~$500–$700), indicating that the 10-fold CV estimates are reliable and no model is severely overfitting. XGBoost's small gap ($16,180 CV vs. $16,840 test) confirms that the hyperparameter tuning generalized well.

Key Takeaways

This project demonstrates the full statistical learning workflow: from data cleaning and feature engineering through model selection, hyperparameter tuning, and interpretability. The Ames dataset's complexity (80 features, mixed types, missing data patterns) makes it a realistic test of practical data science skills. The final XGBoost model explains 94.7% of sale price variance with an RMSE of $16,840 — an error of roughly 9.3% on the median home price. SHAP analysis confirmed that the model aligns with domain knowledge (quality and size matter most), building trust in its predictions for downstream use in property valuation and market analysis.