Comparing regularized regression, tree-based methods, and ensemble models on the Ames Housing dataset — with rigorous cross-validation and interpretability analysis via SHAP values.
Housing price prediction is a canonical regression problem that tests a practitioner's ability to handle messy real-world data: mixed variable types, multicollinearity, non-linear relationships, skewed distributions, and missing values. This project applies a structured statistical learning pipeline to the Ames Housing dataset (2,930 residential sales in Ames, Iowa), comparing six modeling approaches from simple baselines to state-of-the-art gradient boosting.
Beyond raw prediction accuracy, the project emphasizes the bias-variance tradeoff, proper cross-validation methodology, feature engineering decisions, and model interpretability. The final Gradient Boosted Trees model achieves an RMSE of $16,840 on the held-out test set, and SHAP analysis reveals which features drive individual predictions — making the model useful for both appraisals and market analysis.
The Ames dataset contains 80 features describing nearly every aspect of a home. My preprocessing pipeline addressed several challenges: imputing missing values (distinguishing between true missingness and "not applicable" for features like pool quality), log-transforming the skewed target variable, encoding ordinal quality ratings as numeric scales, and creating interaction features for total living area and overall quality.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Load and split data
df = pd.read_csv("ames_housing.csv")
X = df.drop("SalePrice", axis=1)
y = np.log1p(df["SalePrice"]) # Log-transform target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Feature engineering
X_train["TotalSF"] = (X_train["1stFlrSF"] +
X_train["2ndFlrSF"] +
X_train["TotalBsmtSF"])
X_train["QualxSF"] = X_train["OverallQual"] * X_train["TotalSF"]
X_train["Age"] = X_train["YrSold"] - X_train["YearBuilt"]
X_train["RemodAge"] = X_train["YrSold"] - X_train["YearRemodAdd"]
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_train.shape[1]}")
I evaluated six models using 10-fold cross-validation on the training set, then computed final test metrics on the held-out 20%. All models were tuned via grid search or randomized search over their respective hyperparameter spaces. RMSE values are reported on the original dollar scale (after inverse log-transforming predictions).
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
models = {
"OLS Regression": LinearRegression(),
"Ridge (α=12)": Ridge(alpha=12),
"Lasso (α=0.0005)": Lasso(alpha=0.0005),
"Elastic Net": ElasticNet(alpha=0.0005, l1_ratio=0.7),
"Random Forest": RandomForestRegressor(
n_estimators=500, max_depth=12, min_samples_leaf=3
),
"XGBoost": XGBRegressor(
n_estimators=800, max_depth=5, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1
)
}
for name, model in models.items():
cv_scores = cross_val_score(
model, X_train_processed, y_train,
cv=10, scoring="neg_root_mean_squared_error"
)
print(f"{name:20s} CV RMSE: ${-cv_scores.mean()*1000:,.0f} (± ${cv_scores.std()*1000:,.0f})")
A prediction model is only useful if stakeholders trust and understand it. I used SHAP (SHapley Additive exPlanations) to decompose the XGBoost model's predictions into individual feature contributions. SHAP values are grounded in cooperative game theory and provide consistent, locally accurate attributions.
import shap
# Compute SHAP values for test set
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test_processed)
# Global feature importance (mean |SHAP|)
shap.summary_plot(shap_values, X_test_processed, plot_type="bar",
max_display=15, show=False)
# SHAP beeswarm plot — feature value × impact
shap.summary_plot(shap_values, X_test_processed, max_display=15)
Overall quality and total square footage dominate, but the interaction feature (QualxSF) also ranks highly — validating the feature engineering decision.
The progression from OLS to XGBoost illustrates the bias-variance tradeoff in action. OLS has the highest bias (underfitting complex non-linear relationships) but lowest variance. Lasso reduces variance further through feature selection — zeroing out 28 of the 84 features. Random Forest dramatically reduces bias through non-linear splits but at slightly higher variance. XGBoost achieves the best of both worlds: the sequential boosting procedure reduces bias while regularization parameters (max_depth, subsample, reg_alpha) control variance.
The gap between CV RMSE and test RMSE is small across all models (~$500–$700), indicating that the 10-fold CV estimates are reliable and no model is severely overfitting. XGBoost's small gap ($16,180 CV vs. $16,840 test) confirms that the hyperparameter tuning generalized well.
This project demonstrates the full statistical learning workflow: from data cleaning and feature engineering through model selection, hyperparameter tuning, and interpretability. The Ames dataset's complexity (80 features, mixed types, missing data patterns) makes it a realistic test of practical data science skills. The final XGBoost model explains 94.7% of sale price variance with an RMSE of $16,840 — an error of roughly 9.3% on the median home price. SHAP analysis confirmed that the model aligns with domain knowledge (quality and size matter most), building trust in its predictions for downstream use in property valuation and market analysis.