YSI Predictor β€” Yield Sooting Index Model

πŸ“Œ Overview

This repository contains a machine learning model for predicting the Yield Sooting Index (YSI) of single-component fuel molecules directly from their SMILES representation.

YSI is a soot formation metric used in combustion science.

  • Lower YSI β†’ cleaner combustion
  • Highly relevant for diesel replacement fuels, bio-fuels, and oxygenated fuels.

This model supports:

  • molecular design and optimization,
  • genetic algorithms (e.g., CREM),
  • Pareto optimization (CN vs YSI),
  • rapid candidate screening.

🧠 How It Works

The prediction pipeline uses:

  • RDKit β€” molecule parsing
  • Mordred β€” 2D/3D molecular descriptors
  • FeatureSelector β€” dimensionality reduction
  • Tree-based regression model trained on experimental YSI values

Prediction flow:

  1. Input SMILES β†’ RDKit Molecule
  2. Mordred descriptors generated
  3. Feature selection applied
  4. YSI predicted using trained regressor

Two model artifacts are included:

model.joblib # trained regressor selector.joblib # feature selector used during training


🧬 Training Data

The model was trained using a curated dataset of experimentally measured YSI values, covering a diverse set of fuel molecule structures:

Includes:

  • linear alkanes
  • branched alkanes
  • cyclic hydrocarbons
  • aromatics
  • oxygenated species (ethers, esters)

YSI range in dataset: β‰ˆ 3 β†’ 80


πŸ“Š Performance

Performance was evaluated on both training and held-out test sets.

⭐ Training Performance

Metric Score
RMSE 6.9661
MAE 4.0581
RΒ² 0.9309

🧭 Test Performance

Metric Score
RMSE 5.9667
MAE 3.8324
RΒ² 0.9440
MAPE 18.38%

The test RΒ² = 0.9440 shows strong predictive accuracy.


πŸ“‰ Generalization Check

Metric Value
Train RMSE 6.9661
Test RMSE 5.9667
Ξ” (Test βˆ’ Train) βˆ’0.9994

➑️ The negative Ξ” indicates no overfitting, and even better test performance due to more stable distribution.


πŸš€ Usage

Below is a minimal example showing how to use the model in Python.

The feature calculation must match the training pipeline.

import joblib
from rdkit import Chem
from shared_features import featurize_df, FeatureSelector

# Load model & selector
model = joblib.load("model.joblib")
selector = joblib.load("selector.joblib")

def predict_ysi(smiles: str):
    mol = Chem.MolFromSmiles(smiles)
    df = featurize_df([smiles])
    X = selector.transform(df)
    y = model.predict(X)
    return float(y[0])

print(predict_ysi("CCCCCCC"))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using SalZa2004/YSI_Predictor 3