C2S-Scale-Gemma-2-27B Age Prediction (Full Fine-tuning)

This model is a fully fine-tuned version of vandijklab/C2S-Scale-Gemma-2-27B for predicting donor age from single-cell RNA-seq data.

Model Details

Training Configuration

  • Optimizer: AdamW (fused)
  • Learning Rate: 8e-6 with cosine schedule
  • Warmup Ratio: 0.1
  • Weight Decay: 0.005
  • Batch Size: 2 per GPU × 8 GPUs = 16 global batch size
  • Gradient Checkpointing: Enabled
  • Flash Attention 2: Enabled
  • DeepSpeed: ZeRO Stage 3

Training Features

The model was trained with Liger optimizations:

  • Liger RoPE
  • Liger RMSNorm
  • Liger GLU Activation
  • Liger Layer Norm
  • Liger Fused Linear Cross Entropy

Evaluation Results

Our Model Performance: Before vs After Fine-tuning

We compared the performance of the base model (before fine-tuning) against our fine-tuned model on the age prediction task:

Model Pearson Correlation MAE (years) RMSE (years)
Base Model (before fine-tuning) -0.09 16.2 20.3
Our Fine-tuned Model 0.20 10.2 12.9
Improvement +0.29 -6.0 -7.4

Key Improvements:

  • Pearson Correlation: Improved from negative correlation (-0.09) to positive correlation (0.20), indicating the model learned meaningful age-related patterns
  • MAE reduction: 37% improvement (from 16.2 to 10.2 years)
  • RMSE reduction: 36% improvement (from 20.3 to 12.9 years)

Comparison with Published Age Clocks

The table below compares our model with other published age prediction models from scAgeClock (Xie et al., 2025):

Model Pearson r MAE (years) RMSE (years) Reference
GMA (scAgeClock) 0.78 10.7 13.5 Xie, 2025
XGBoost 0.77 10.8 13.6 Xie, 2025
MLP 0.75 11.1 13.8 Xie, 2025
CatBoost 0.74 10.9 14.2 Xie, 2025
Elastic Net 0.73 12.0 15.3 Xie, 2025
Our Model (C2S-Gemma) 0.20 10.2 12.9 This work

Important Note: The published models were evaluated on different datasets than ours, making direct comparison not fully appropriate. However, this comparison provides useful context for understanding the general performance landscape of age prediction models:

  • MAE Comparison: Our model achieves competitive MAE (10.2 years), which is the best among all models in absolute error terms
  • RMSE Comparison: Our RMSE (12.9 years) is also the lowest, suggesting good performance on larger prediction errors
  • Pearson Correlation Gap: The significantly lower Pearson correlation (0.20 vs 0.73-0.78) indicates room for improvement in capturing linear age relationships

Key Observations:

  1. Our model's error metrics (MAE and RMSE) are competitive with or better than established age clocks
  2. The lower correlation suggests our model may benefit from:
    • Extended training (as suggested by the non-plateauing loss curve)
    • Larger or more diverse training datasets
    • Architecture or hyperparameter optimization
  3. Despite the lower correlation, the model shows practical utility with prediction errors comparable to state-of-the-art methods

Training Loss Analysis

Training Loss Curve

The training loss curve demonstrates consistent and steady improvement throughout the training process:

  • Loss decreased from approximately 1.35 to 1.05 over 5,991 steps
  • The curve shows a consistent downward trend with a steady angle throughout training
  • No plateau observed: The loss continues to decrease without flattening, even in the final training steps
  • This indicates the model has not yet converged to its optimal performance

Important observation: The absence of a plateau in the training loss suggests that extending training duration would likely yield further improvements in model performance. The current results represent a snapshot of the model's capabilities, with potential for enhanced accuracy through additional training steps.

Prompt Template

The model was trained using the following prompt structure with aging-related genes from the Open Genes database:

The following is a list of {num_genes} gene aging related gene names from Open Genes database ordered by descending expression level in a {organism} cell.

Sex: {sex}
Smoking status: {smoking_status}
Tissue: {tissue}
Cell type: {cell_type}
Aging related cell sentence: {gene_sentence_opengenes}
Predict the Age of the donor from whom these cells were taken.
Answer only with age value in years:

Example Input

The following is a list of 1000 gene aging related gene names from Open Genes database ordered by descending expression level in a Homo sapiens cell.

Sex: female
Smoking status: non-smoker
Tissue: blood
Cell type: CD4+ T cell
Aging related cell sentence: FOXP3 IL2RA CTLA4 TNFRSF18 IKZF2 CCR4 IL32 TIGIT TNFRSF4 LAG3 CD3D CD3E IL7R LTB TRAC CD2 CD52 TRBC2 ...
Predict the Age of the donor from whom these cells were taken.
Answer only with age value in years:

Expected Output

45

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft",
    trust_remote_code=True
)

# Example prompt
prompt = """The following is a list of 1000 gene aging related gene names from Open Genes database ordered by descending expression level in a Homo sapiens cell.

Sex: female
Smoking status: non-smoker
Tissue: blood
Cell type: CD4+ T cell
Aging related cell sentence: FOXP3 IL2RA CTLA4 TNFRSF18 IKZF2 CCR4 IL32 TIGIT TNFRSF4 LAG3 CD3D CD3E IL7R LTB TRAC CD2 CD52 TRBC2 ...
Predict the Age of the donor from whom these cells were taken.
Answer only with age value in years:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20, temperature=0.0)
predicted_age = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(f"Predicted age: {predicted_age}")

Dataset

The model was trained on single-cell RNA-seq data with aging-related gene signatures from the Open Genes database. The dataset includes:

  • Cell type annotations
  • Donor metadata (age, sex, tissue, smoking status)
  • Gene expression profiles converted to cell sentences using aging-related genes
  • Focus on genes with known roles in aging processes

Key Features

  • Aging-Specific Gene Focus: Uses curated aging-related genes from Open Genes database
  • Cell-Level Predictions: Predicts donor age from individual cell transcriptomes
  • Metadata Integration: Incorporates sex, tissue, cell type, and smoking status
  • Zero-shot Generalization: Can predict ages for unseen cell types and tissues

Intended Use

This model is designed for:

  • Research in cellular aging
  • Age prediction from single-cell transcriptomics
  • Understanding age-related gene expression patterns
  • Biological age estimation
  • Cell aging biomarker discovery

Limitations

  • Predictions are based on single-cell gene expression patterns
  • Performance may vary across different tissues and cell types
  • Requires proper cell sentence formatting with aging-related genes
  • Should be used for research purposes only
  • Trained specifically on Asian PBMC samples, generalization to other populations should be validated

Citation

If you use this model, please cite:

@misc{c2s-gemma2-27b-age,
  title={C2S-Scale-Gemma-2-27B Age Prediction},
  author={Transhumanist Research},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft}
}

Also cite the base Cell2Sentence model and Open Genes database.

References

Comparison Models

License

This model inherits the Gemma license from the base model.

Acknowledgments

This model was developed during the Evolved 2025 Hackathon.

We would like to thank the hackathon organizers and our GPU compute providers for making this work possible:

Downloads last month
112
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft

Base model

google/gemma-2-27b
Finetuned
(1)
this model

Dataset used to train transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft

Collection including transhumanist-already-exists/C2S-Scale-Gemma-2-27B-age-prediction-fullft