Base model: westlake-repl/SaProt_35M_AF2
Task type: protein-level classification
Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available. To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set. Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
Label mapping: Label 0: Oxidoreductase (EC1) Label 1: Transferase (EC2) Label 2: Hydrolase (EC3) Label 3: Lyase (EC4) Label 4: Isomerase (EC5) Label 5: Ligase (EC6)
Training set distribution:
- Label 0: 1497 (28.5%)
- Label 2: 1217 (23.2%)
- Label 1: 1050 (19.9%)
- Label 3: 512 (9.7%)
- Label 4: 496 (9.4%)
- Label 5: 483 (9.2%) Total: 5255 samples
Validation set distribution:
- Label 0: 187 (32.0%)
- Label 2: 152 (26.0%)
- Label 1: 131 (22.4%)
- Label 3: 64 (10.9%)
- Label 4: 31 (5.3%)
- Label 5: 20 (3.4%) Total: 585 samples
Test set distribution:
- Label 0: 188 (31.8%)
- Label 2: 153 (25.9%)
- Label 1: 132 (22.3%)
- Label 3: 65 (11.0%)
- Label 4: 32 (5.4%)
- Label 5: 21 (3.5%) Total: 591 samples
Model input type: Amino acid sequence
Performance (on test set): 0.68 accuracy
LoRA config: r: 8 lora_dropout: 0.1 lora_alpha: 16 target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"] modules_to_save: ["classifier"]
Training config: optimizer: class: AdamW betas: (0.9, 0.98) weight_decay: 0.01 learning rate: 0.0005 epoch: 25 batch size: 64 precision: 16-mixed
- Downloads last month
- -
Model tree for SaProtHub/EC-classification-35M
Base model
westlake-repl/SaProt_35M_AF2