Woningwaardering ML
A gradient boosting regression model for predicting Dutch housing valuation points (woningwaardering points) based on property characteristics.
This project is done for learning and educational purposes. It is not intended to be used for real-world applications.
Model Details
Model Description
This is a gradient boosting regression model that predicts Dutch housing valuation points (woningwaardering points) based on property characteristics. The model uses XGBoost's XGBRegressor with hyperparameter tuning.
- Developed by: Tomer Gabay
- Model type: XGBoost Regressor
- Language: Python
- License: MIT
Intended Use
Primary Use Cases
- Researching the impact of housing characteristics on woningwaardering points
- Predicting housing valuation points for properties in the Netherlands
- Real estate valuation assistance
- Housing market analysis
Out-of-Scope Use Cases
- Legal or official property valuation (this is a predictive model, not an official assessment)
- Properties outside the Netherlands
- Properties with characteristics significantly different from the training data
- Non-self-contained dwellings
Quick Start
Installation
Using uv (recommended)
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv pip install -e .
Using pip
# Install as package (recommended)
pip install -e .
# Or just install dependencies
pip install -r requirements.txt
Training the Model
- Open and run the
train.ipynbnotebook - The notebook includes:
- Data loading from Hugging Face
- Data preprocessing and feature engineering
- Model training with hyperparameter tuning (RandomizedSearchCV with 5-fold cross-validation)
- Model evaluation and metrics
- Feature importance analysis
- Visualizations
- Model saving
Using a Trained Model
Option 1: Using the High-Level API (Recommended)
from woningwaardering_ml import WoningwaarderingPredictor
import pandas as pd
# Initialize predictor (loads model and preprocessing components)
predictor = WoningwaarderingPredictor()
# Prepare your data
new_property = pd.DataFrame({
'year_built': [1990],
'number_of_rooms': [5],
'number_of_bedrooms': [3],
'indoor_area': [120],
'outdoor_area': [50],
'energy_label': ['B'],
'property_value': [350000]
})
# Make prediction
predictions = predictor.predict(new_property)
print(f"Predicted woningwaardering points: {predictions.iloc[0]}")
# Batch prediction
multiple_properties = pd.DataFrame({
'year_built': [1990, 2010, 1985],
'number_of_rooms': [5, 4, 6],
'number_of_bedrooms': [3, 2, 4],
'indoor_area': [120, 90, 150],
'outdoor_area': [50, 30, 80],
'energy_label': ['B', 'A', 'C'],
'property_value': [350000, 280000, 420000]
})
batch_predictions = predictor.predict(multiple_properties)
print(batch_predictions)
Option 2: Using Lower-Level Functions
from woningwaardering_ml import load_model, load_preprocessing_components, preprocess_data
import pandas as pd
# Load model and preprocessing components
model = load_model('woningwaardering_model.json')
energy_label_encoder, feature_cols = load_preprocessing_components()
# Prepare your data
new_property = pd.DataFrame({
'year_built': [1990],
'number_of_rooms': [5],
'number_of_bedrooms': [3],
'indoor_area': [120],
'outdoor_area': [50],
'energy_label': ['B'],
'property_value': [350000]
})
# Preprocess data
new_property_processed, _ = preprocess_data(
new_property,
energy_label_encoder=energy_label_encoder
)
# Select features in the same order as training
X_new = new_property_processed[feature_cols]
# Make prediction
prediction = model.predict(X_new)
print(f"Predicted woningwaardering points: {prediction[0]}")
Option 3: Using Command Line Interface
# Interactive mode
python main.py
# With JSON input
python main.py --property '{"year_built":1990,"number_of_rooms":5,"number_of_bedrooms":3,"indoor_area":120,"outdoor_area":50,"energy_label":"B","property_value":350000}'
# With input file
python main.py --input properties.json --output predictions.csv
Required Input Features
The model requires the following features:
year_built(int): Year the property was builtnumber_of_bedrooms(int): Number of bedrooms.indoor_area(float): Total indoor area in mΒ², including storage space and garage.outdoor_area(float): Total outdoor area in mΒ².energy_label(str): Most recent valid energy label. One of: G, F, E, D, C, B, A, A+, A++, A+++, or A++++ (ordered worst to best).property_value(float): Most recent official property value (WOZ waarde), as issued by the government, in euros.
Project Structure
woningwaardering-ml/
βββ woningwaardering_ml/ # Python package
β βββ __init__.py # Package initialization
β βββ preprocessing.py # Data preprocessing utilities
β βββ model.py # Model loading utilities
β βββ inference.py # High-level inference API
βββ train.ipynb # Main training and evaluation notebook
βββ main.py # CLI entry point for inference
βββ push_to_hub.py # Script to push model to Hugging Face Hub
βββ pyproject.toml # Project dependencies (uv)
βββ requirements.txt # Alternative requirements file (pip)
βββ PUBLISHING_GUIDE.md # Guide for publishing to Hugging Face
βββ README.md # This file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
β
βββ Generated files (after training):
βββ woningwaardering_model.json # Trained model
βββ energy_label_encoder.pkl # Energy label encoder
βββ feature_columns.json # Feature column names (order matters)
βββ model_metadata.json # Model metadata and hyperparameters
Training Data
Dataset
- Name: woonstadrotterdam/woningwaarderingen-collab
- Source: Hugging Face Datasets
- Splits: Train (80%), Validation (10%), Test (10%)
- Total samples: 27,496
- Training samples: 22,750 (train + val, used cross-validation)
For detailed dataset information, see the dataset card.
Data Preprocessing
- Removed
description,single_family_home,number_of_rooms,organizationcolumns - Ordinal encoding for
energy_label(G to A++++) βenergy_label_encoded - Final model uses 6 features:
year_builtnumber_of_bedroomsindoor_areaoutdoor_areaproperty_valueenergy_label_encoded
Performance
Metrics
- Test MAE: 2.274 points
- Test MAPE: 1.348%
Evaluation Results
The model was evaluated on a held-out test set of 2,750 samples.
Model Architecture
Hyperparameters
- Algorithm: XGBRegressor
- Hyperparameter Tuning: RandomizedSearchCV with 5-fold cross-validation
- Reproducibility: Random seed = 42 (set for NumPy, Python random, and all models including XGBoost)
- Features: 6 features (year_built, number_of_bedrooms, indoor_area, outdoor_area, property_value, energy_label_encoded)
- Best parameters found:
- n_estimators: 2000
- max_depth: 9
- learning_rate: 0.03
- subsample: 0.8
- colsample_bytree: 0.9
- min_child_weight: 1
- reg_alpha: 0
- reg_lambda: 1
Changelog
Version 1.1.0 (Jan 2026)
- Model upgrade: Migrated from scikit-learn's GradientBoostingRegressor to XGBoost's XGBRegressor
- Performance improvements:
- Test MAE reduced from 2.97 to 2.274 points
- Test MAPE reduced from 1.91% to 1.348%
- Training optimization: Switched from GridSearchCV to RandomizedSearchCV for faster hyperparameter tuning
- Dataset update:
- Updated to use woonstadrotterdam/woningwaarderingen-collab dataset, which is a larger and more diverse dataset than the original dataset.
- Combined train and validation sets since we use cross-validation during hyperparameter tuning.
- Dependencies: Added xgboost>=2.0.0 requirement
Version 1.0.0 (Oct 2025)
- Initial release with GradientBoostingRegressor
- Test MAE: 2.97 points
- Test MAPE: 1.91%
- Based on https://huggingface.co/datasets/woonstadrotterdam/woningwaarderingen
Development
Requirements
- Python >= 3.13
- See
pyproject.tomlorrequirements.txtfor dependencies
Reproducibility
The model uses a fixed random seed (42) for reproducibility. This is set for:
- NumPy random number generation
- Python's random module
- All models including XGBoost
Running Tests
Currently, the main training and evaluation is done in the train.ipynb notebook. The notebook includes:
- Train/validation/test split evaluation
- Cross-validation during hyperparameter tuning
- Feature importance analysis
Or use the notebook cell to push directly.
Citation
If you use this model, please cite:
@software{woningwaardering_ml,
title = {Woningwaardering ML Model},
author = {Gabay, Tomer},
organization = {Woonstad Rotterdam},
year = {2025},
url = {https://huggingface.co/woonstadrotterdam/woningwaardering-ml}
}
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Contact
- Email: [email protected]
- Organization: Woonstad Rotterdam
Limitations
- Geographic Scope: Trained on Rotterdam and Brabant data, may not generalize to other regions
- Temporal Limitations:
- Based on historical data, may not reflect future regulatory changes
- Dwelling Type: Only trained on self-contained dwellings; will not work for non-self-contained dwellings
- Data Quality: Performance depends on quality and completeness of input features
- Dataset Bias: The dataset is biased towards social housing dwellings and may have less representation of higher-end dwellings
For more information about the dataset, see the dataset card.
Dataset used to train woonstadrotterdam/woningwaardering-ml
Space using woonstadrotterdam/woningwaardering-ml 1
Evaluation results
- Mean Absolute Error on woonstadrotterdam/woningwaarderingen-collabtest set self-reported2.274
- Mean Absolute Percentage Error on woonstadrotterdam/woningwaarderingen-collabtest set self-reported1.348