Model description

Gradient boosting regressor trained on California Housing dataset

The model is a gradient boosting regressor from sklearn. On top of the standard features, it contains predictions from a KNN models. These predictions are calculated out of fold, then added on top of the existing features. These features are really helpful for decision tree-based models, since those cannot easily learn from geospatial data.

Intended uses & limitations

This model is meant for demonstration purposes

Training Procedure

Hyperparameters

The model is trained with below hyperparameters.

Click to expand

Hyperparameter	Value
cv
estimators	[('knn@5', Pipeline(steps=[('select_cols', ColumnTransformer(transformers=[('long_and_lat', 'passthrough', ['Longitude', 'Latitude'])])), ('knn', KNeighborsRegressor())]))]
final_estimator__alpha	0.9
final_estimator__ccp_alpha	0.0
final_estimator__criterion	friedman_mse
final_estimator__init
final_estimator__learning_rate	0.1
final_estimator__loss	squared_error
final_estimator__max_depth	3
final_estimator__max_features
final_estimator__max_leaf_nodes
final_estimator__min_impurity_decrease	0.0
final_estimator__min_samples_leaf	1
final_estimator__min_samples_split	2
final_estimator__min_weight_fraction_leaf	0.0
final_estimator__n_estimators	500
final_estimator__n_iter_no_change
final_estimator__random_state	0
final_estimator__subsample	1.0
final_estimator__tol	0.0001
final_estimator__validation_fraction	0.1
final_estimator__verbose	0
final_estimator__warm_start	False
final_estimator	GradientBoostingRegressor(n_estimators=500, random_state=0)
n_jobs
passthrough	True
verbose	0
knn@5	Pipeline(steps=[('select_cols', ColumnTransformer(transformers=[('long_and_lat', 'passthrough', ['Longitude', 'Latitude'])])), ('knn', KNeighborsRegressor())])
knn@5__memory
knn@5__steps	[('select_cols', ColumnTransformer(transformers=[('long_and_lat', 'passthrough', ['Longitude', 'Latitude'])])), ('knn', KNeighborsRegressor())]
knn@5__verbose	False
knn@5__select_cols	ColumnTransformer(transformers=[('long_and_lat', 'passthrough', ['Longitude', 'Latitude'])])
knn@5__knn	KNeighborsRegressor()
knn@5__select_cols__n_jobs
knn@5__select_cols__remainder	drop
knn@5__select_cols__sparse_threshold	0.3
knn@5__select_cols__transformer_weights
knn@5__select_cols__transformers	[('long_and_lat', 'passthrough', ['Longitude', 'Latitude'])]
knn@5__select_cols__verbose	False
knn@5__select_cols__verbose_feature_names_out	True
knn@5__select_cols__long_and_lat	passthrough
knn@5__knn__algorithm	auto
knn@5__knn__leaf_size	30
knn@5__knn__metric	minkowski
knn@5__knn__metric_params
knn@5__knn__n_jobs
knn@5__knn__n_neighbors	5
knn@5__knn__p	2
knn@5__knn__weights	uniform

Model Plot

The model plot is below.

StackingRegressor(estimators=[('knn@5',Pipeline(steps=[('select_cols',ColumnTransformer(transformers=[('long_and_lat','passthrough',['Longitude','Latitude'])])),('knn',KNeighborsRegressor())]))],final_estimator=GradientBoostingRegressor(n_estimators=500,random_state=0),passthrough=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation Results

Metrics are calculated on the test set

Metric	Value
Root mean squared error	44273.5
Mean absolute error	30079.9
R²	0.805954

Dataset description

California Housing dataset

Data Set Characteristics:

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

An household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surpinsingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the :func:sklearn.datasets.fetch_california_housing function.

.. topic:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

Data distribution

Click to expand

How to Get Started with the Model

Run the code below to load the model

import json
import pandas as pd
import skops.io as sio
model = sio.load("model.skops")
with open("config.json") as f:
    config = json.load(f)
model.predict(pd.DataFrame.from_dict(config["sklearn"]["example_input"]))

Ahmadswaid
/

example-california-housing