YSI Predictor β Yield Sooting Index Model
π Overview
This repository contains a machine learning model for predicting the Yield Sooting Index (YSI) of single-component fuel molecules directly from their SMILES representation.
YSI is a soot formation metric used in combustion science.
- Lower YSI β cleaner combustion
- Highly relevant for diesel replacement fuels, bio-fuels, and oxygenated fuels.
This model supports:
- molecular design and optimization,
- genetic algorithms (e.g., CREM),
- Pareto optimization (CN vs YSI),
- rapid candidate screening.
π§ How It Works
The prediction pipeline uses:
- RDKit β molecule parsing
- Mordred β 2D/3D molecular descriptors
- FeatureSelector β dimensionality reduction
- Tree-based regression model trained on experimental YSI values
Prediction flow:
- Input SMILES β RDKit Molecule
- Mordred descriptors generated
- Feature selection applied
- YSI predicted using trained regressor
Two model artifacts are included:
model.joblib # trained regressor selector.joblib # feature selector used during training
𧬠Training Data
The model was trained using a curated dataset of experimentally measured YSI values, covering a diverse set of fuel molecule structures:
Includes:
- linear alkanes
- branched alkanes
- cyclic hydrocarbons
- aromatics
- oxygenated species (ethers, esters)
YSI range in dataset: β 3 β 80
π Performance
Performance was evaluated on both training and held-out test sets.
β Training Performance
| Metric | Score |
|---|---|
| RMSE | 6.9661 |
| MAE | 4.0581 |
| RΒ² | 0.9309 |
π§ Test Performance
| Metric | Score |
|---|---|
| RMSE | 5.9667 |
| MAE | 3.8324 |
| RΒ² | 0.9440 |
| MAPE | 18.38% |
The test RΒ² = 0.9440 shows strong predictive accuracy.
π Generalization Check
| Metric | Value |
|---|---|
| Train RMSE | 6.9661 |
| Test RMSE | 5.9667 |
| Ξ (Test β Train) | β0.9994 |
β‘οΈ The negative Ξ indicates no overfitting, and even better test performance due to more stable distribution.
π Usage
Below is a minimal example showing how to use the model in Python.
The feature calculation must match the training pipeline.
import joblib
from rdkit import Chem
from shared_features import featurize_df, FeatureSelector
# Load model & selector
model = joblib.load("model.joblib")
selector = joblib.load("selector.joblib")
def predict_ysi(smiles: str):
mol = Chem.MolFromSmiles(smiles)
df = featurize_df([smiles])
X = selector.transform(df)
y = model.predict(X)
return float(y[0])
print(predict_ysi("CCCCCCC"))