Woningwaardering ML

A gradient boosting regression model for predicting Dutch housing valuation points (woningwaardering points) based on property characteristics.

This project is done for learning and educational purposes. It is not intended to be used for real-world applications.

Model Details

Model Description

This is a gradient boosting regression model that predicts Dutch housing valuation points (woningwaardering points) based on property characteristics. The model uses XGBoost's XGBRegressor with hyperparameter tuning.

Developed by: Tomer Gabay
Model type: XGBoost Regressor
Language: Python
License: MIT

Intended Use

Primary Use Cases

Researching the impact of housing characteristics on woningwaardering points
Predicting housing valuation points for properties in the Netherlands
Real estate valuation assistance
Housing market analysis

Out-of-Scope Use Cases

Legal or official property valuation (this is a predictive model, not an official assessment)
Properties outside the Netherlands
Properties with characteristics significantly different from the training data
Non-self-contained dwellings

Quick Start

Installation

Using uv (recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv pip install -e .

Using pip

# Install as package (recommended)
pip install -e .

# Or just install dependencies
pip install -r requirements.txt

Training the Model

Open and run the train.ipynb notebook
The notebook includes:
- Data loading from Hugging Face
- Data preprocessing and feature engineering
- Model training with hyperparameter tuning (RandomizedSearchCV with 5-fold cross-validation)
- Model evaluation and metrics
- Feature importance analysis
- Visualizations
- Model saving

Using a Trained Model

Option 1: Using the High-Level API (Recommended)

from woningwaardering_ml import WoningwaarderingPredictor
import pandas as pd

# Initialize predictor (loads model and preprocessing components)
predictor = WoningwaarderingPredictor()

# Prepare your data
new_property = pd.DataFrame({
    'year_built': [1990],
    'number_of_rooms': [5],
    'number_of_bedrooms': [3],
    'indoor_area': [120],
    'outdoor_area': [50],
    'energy_label': ['B'],
    'property_value': [350000]
})

# Make prediction
predictions = predictor.predict(new_property)
print(f"Predicted woningwaardering points: {predictions.iloc[0]}")

# Batch prediction
multiple_properties = pd.DataFrame({
    'year_built': [1990, 2010, 1985],
    'number_of_rooms': [5, 4, 6],
    'number_of_bedrooms': [3, 2, 4],
    'indoor_area': [120, 90, 150],
    'outdoor_area': [50, 30, 80],
    'energy_label': ['B', 'A', 'C'],
    'property_value': [350000, 280000, 420000]
})

batch_predictions = predictor.predict(multiple_properties)
print(batch_predictions)

Option 2: Using Lower-Level Functions

from woningwaardering_ml import load_model, load_preprocessing_components, preprocess_data
import pandas as pd

# Load model and preprocessing components
model = load_model('woningwaardering_model.json')
energy_label_encoder, feature_cols = load_preprocessing_components()

# Prepare your data
new_property = pd.DataFrame({
    'year_built': [1990],
    'number_of_rooms': [5],
    'number_of_bedrooms': [3],
    'indoor_area': [120],
    'outdoor_area': [50],
    'energy_label': ['B'],
    'property_value': [350000]
})

# Preprocess data
new_property_processed, _ = preprocess_data(
    new_property, 
    energy_label_encoder=energy_label_encoder
)

# Select features in the same order as training
X_new = new_property_processed[feature_cols]

# Make prediction
prediction = model.predict(X_new)
print(f"Predicted woningwaardering points: {prediction[0]}")

Option 3: Using Command Line Interface

# Interactive mode
python main.py

# With JSON input
python main.py --property '{"year_built":1990,"number_of_rooms":5,"number_of_bedrooms":3,"indoor_area":120,"outdoor_area":50,"energy_label":"B","property_value":350000}'

# With input file
python main.py --input properties.json --output predictions.csv

Required Input Features

The model requires the following features:

year_built (int): Year the property was built
number_of_bedrooms (int): Number of bedrooms.
indoor_area (float): Total indoor area in m², including storage space and garage.
outdoor_area (float): Total outdoor area in m².
energy_label (str): Most recent valid energy label. One of: G, F, E, D, C, B, A, A+, A++, A+++, or A++++ (ordered worst to best).
property_value (float): Most recent official property value (WOZ waarde), as issued by the government, in euros.

Project Structure

woningwaardering-ml/
├── woningwaardering_ml/        # Python package
│   ├── __init__.py             # Package initialization
│   ├── preprocessing.py        # Data preprocessing utilities
│   ├── model.py                # Model loading utilities
│   └── inference.py            # High-level inference API
├── train.ipynb                 # Main training and evaluation notebook
├── main.py                     # CLI entry point for inference
├── push_to_hub.py              # Script to push model to Hugging Face Hub
├── pyproject.toml              # Project dependencies (uv)
├── requirements.txt            # Alternative requirements file (pip)
├── PUBLISHING_GUIDE.md         # Guide for publishing to Hugging Face
├── README.md                   # This file
├── LICENSE                     # MIT License
├── .gitignore                  # Git ignore rules
│
├── Generated files (after training):
├── woningwaardering_model.json  # Trained model
├── energy_label_encoder.pkl    # Energy label encoder
├── feature_columns.json        # Feature column names (order matters)
└── model_metadata.json         # Model metadata and hyperparameters

Training Data

Dataset

Name: woonstadrotterdam/woningwaarderingen-collab
Source: Hugging Face Datasets
Splits: Train (80%), Validation (10%), Test (10%)
Total samples: 27,496
Training samples: 22,750 (train + val, used cross-validation)

For detailed dataset information, see the dataset card.

Data Preprocessing

Removed description, single_family_home, number_of_rooms, organization columns
Ordinal encoding for energy_label (G to A++++) → energy_label_encoded
Final model uses 6 features:
- year_built
- number_of_bedrooms
- indoor_area
- outdoor_area
- property_value
- energy_label_encoded

Performance

Metrics

Test MAE: 2.274 points
Test MAPE: 1.348%

Evaluation Results

The model was evaluated on a held-out test set of 2,750 samples.

Model Architecture

Hyperparameters

Algorithm: XGBRegressor
Hyperparameter Tuning: RandomizedSearchCV with 5-fold cross-validation
Reproducibility: Random seed = 42 (set for NumPy, Python random, and all models including XGBoost)
Features: 6 features (year_built, number_of_bedrooms, indoor_area, outdoor_area, property_value, energy_label_encoded)
Best parameters found:
- n_estimators: 2000
- max_depth: 9
- learning_rate: 0.03
- subsample: 0.8
- colsample_bytree: 0.9
- min_child_weight: 1
- reg_alpha: 0
- reg_lambda: 1

Changelog

Version 1.1.0 (Jan 2026)

Model upgrade: Migrated from scikit-learn's GradientBoostingRegressor to XGBoost's XGBRegressor
Performance improvements:
- Test MAE reduced from 2.97 to 2.274 points
- Test MAPE reduced from 1.91% to 1.348%
Training optimization: Switched from GridSearchCV to RandomizedSearchCV for faster hyperparameter tuning
Dataset update:
- Updated to use woonstadrotterdam/woningwaarderingen-collab dataset, which is a larger and more diverse dataset than the original dataset.
- Combined train and validation sets since we use cross-validation during hyperparameter tuning.
Dependencies: Added xgboost>=2.0.0 requirement

Version 1.0.0 (Oct 2025)

Initial release with GradientBoostingRegressor
Test MAE: 2.97 points
Test MAPE: 1.91%
Based on https://huggingface.co/datasets/woonstadrotterdam/woningwaarderingen

Development

Requirements

Python >= 3.13
See pyproject.toml or requirements.txt for dependencies

Reproducibility

The model uses a fixed random seed (42) for reproducibility. This is set for:

NumPy random number generation
Python's random module
All models including XGBoost

Running Tests

Currently, the main training and evaluation is done in the train.ipynb notebook. The notebook includes:

Train/validation/test split evaluation
Cross-validation during hyperparameter tuning
Feature importance analysis

Or use the notebook cell to push directly.

Citation

If you use this model, please cite:

@software{woningwaardering_ml,
  title = {Woningwaardering ML Model},
  author = {Gabay, Tomer},
  organization = {Woonstad Rotterdam},
  year = {2025},
  url = {https://huggingface.co/woonstadrotterdam/woningwaardering-ml}
}

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Contact

Email: [email protected]
Organization: Woonstad Rotterdam

Limitations

Geographic Scope: Trained on Rotterdam and Brabant data, may not generalize to other regions
Temporal Limitations:
- Based on historical data, may not reflect future regulatory changes
Dwelling Type: Only trained on self-contained dwellings; will not work for non-self-contained dwellings
Data Quality: Performance depends on quality and completeness of input features
Dataset Bias: The dataset is biased towards social housing dwellings and may have less representation of higher-end dwellings

For more information about the dataset, see the dataset card.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train woonstadrotterdam/woningwaardering-ml

Space using woonstadrotterdam/woningwaardering-ml 1

Evaluation results

Mean Absolute Error on woonstadrotterdam/woningwaarderingen-collab
test set self-reported

2.274
Mean Absolute Percentage Error on woonstadrotterdam/woningwaarderingen-collab
test set self-reported

1.348