Back to Machine Learning

California Housing Regression with Five Models and Geographic Error Analysis

UCI California Housing · 5 Regressors · 5-Fold CV · Permutation Importance

Abstract

Five regressors (Linear, Ridge, Lasso, Random Forest, Gradient Boosting) on the California Housing dataset, evaluated under 5-fold cross-validation. Tree ensembles dominate linear models by a large margin (R² 0.82 vs 0.60), revealing strong non-linear structure in the feature-to-target mapping. Permutation importance corroborates impurity-based importance and identifies median income, geographic location, and house age as the dominant predictors. A geographic error map shows where the best model still struggles: the Bay Area and the LA basin, where the dataset's 500k cap censors the upper tail of the target distribution.

Dataset

20,640 California census-tract observations, eight features:

FeatureMeaning
MedIncmedian income, tens of thousands USD
HouseAgemedian house age in years
AveRoomsaverage rooms per household
AveBedrmsaverage bedrooms per household
Populationblock population
AveOccupaverage occupants per household
Latitude / Longitudeblock centroid

Target is the median house value of the block, scaled to units of 100k USD, capped at 5.0 (500k). The cap is visible as a spike at the right of the target histogram.

Two-panel figure: target distribution histogram with the 500k cap visible at the right, alongside a feature correlation matrix

Headline Results

Stratified 5-fold cross-validation. RMSE and MAE in units of 100k USD; MAPE as a percentage; R² unitless.

Model CV RMSE CV MAE CV R² CV MAPE
Gradient Boosting0.489 ± 0.0130.3320.82018.9 %
Random Forest0.501 ± 0.0110.3260.81118.4 %
Linear0.728 ± 0.0170.5320.60131.8 %
Ridge0.728 ± 0.0170.5320.60131.8 %
Lasso0.730 ± 0.0140.5350.59931.9 %

Tree ensembles cut RMSE by a third and MAPE roughly in half against the linear baseline. Ridge and Lasso are indistinguishable from plain Linear on this dataset because the features are not particularly collinear and there is no benefit to L1 / L2 regularisation when the linear hypothesis class is itself the bottleneck.

CV Score Distributions

Four-panel boxplot of stratified 5-fold RMSE, MAE, R-squared, and MAPE for five regressors

Predicted vs Actual

Linear regression vs Gradient Boosting on the same data. The linear model's predictions cluster around the mean with a clear underprediction at the upper tail; the boosted model resolves the relationship much better but still bumps up against the 500k cap (the ceiling at y = 5.0).

Side-by-side scatter plots of actual vs predicted house value for Linear regression and Gradient Boosting, both with a y = x reference line

Residual Diagnostics

Residuals (actual minus predicted) for Gradient Boosting. Two structural features stand out: the residual cluster at the right of the predicted-axis is the 500k-cap effect (the model can never predict above the censoring boundary, so high-value blocks always show negative residuals); the residual histogram has a slightly heavier right tail than left, indicating modest under-prediction on the high end.

Residual scatter plot vs predicted value (left) and residual histogram (right) for the Gradient Boosting model

Feature Importance: Three Views

Random Forest impurity (left), Gradient Boosting impurity (middle), Random Forest permutation importance (right, eight repeats). All three orderings agree on the top three: median income, latitude, longitude. Permutation importance shrinks the role of low-cardinality features (HouseAge) relative to impurity, the well-known impurity-importance bias-correction story.

Three-panel feature importance comparison: Random Forest impurity importance, Gradient Boosting impurity importance, Random Forest permutation importance with error bars

Geographic Error Map

Left: actual median house value by census block, plotted on California's coastline shape. Right: absolute error of the Gradient Boosting model on the same blocks. The error concentrates in the Bay Area and along the LA / Orange County coast: the regions where median house values frequently exceed the dataset's 500k cap and the model has no signal to fit beyond it.

Two side-by-side maps of California: median house value coloured by block on the left, absolute prediction error coloured by block on the right, with errors concentrated in the Bay Area and LA basin