Base model: westlake-repl/SaProt_35M_AF2

Task type: protein-level classification

Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available. To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set. Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833

Label mapping: Label 0: Oxidoreductase (EC1) Label 1: Transferase (EC2) Label 2: Hydrolase (EC3) Label 3: Lyase (EC4) Label 4: Isomerase (EC5) Label 5: Ligase (EC6)

Training set distribution:

  • Label 0: 1497 (28.5%)
  • Label 2: 1217 (23.2%)
  • Label 1: 1050 (19.9%)
  • Label 3: 512 (9.7%)
  • Label 4: 496 (9.4%)
  • Label 5: 483 (9.2%) Total: 5255 samples

Validation set distribution:

  • Label 0: 187 (32.0%)
  • Label 2: 152 (26.0%)
  • Label 1: 131 (22.4%)
  • Label 3: 64 (10.9%)
  • Label 4: 31 (5.3%)
  • Label 5: 20 (3.4%) Total: 585 samples

Test set distribution:

  • Label 0: 188 (31.8%)
  • Label 2: 153 (25.9%)
  • Label 1: 132 (22.3%)
  • Label 3: 65 (11.0%)
  • Label 4: 32 (5.4%)
  • Label 5: 21 (3.5%) Total: 591 samples

Model input type: Amino acid sequence

Performance (on test set): 0.68 accuracy

LoRA config: r: 8 lora_dropout: 0.1 lora_alpha: 16 target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"] modules_to_save: ["classifier"]

Training config: optimizer: class: AdamW betas: (0.9, 0.98) weight_decay: 0.01 learning rate: 0.0005 epoch: 25 batch size: 64 precision: 16-mixed

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SaProtHub/EC-classification-35M

Adapter
(61)
this model