mlsa-iai-msu-lab/sci-rus-tiny3.5-zh
Model Description
This is a multilingual encoder model designed for scientific text embeddings, supporting Russian, English, and Chinese languages. It is trained on a large corpus of scientific papers, citations, and co-citations to capture semantic similarity across these languages.
Training Stages
The model was trained from scratch in two stages: Masked Language Modeling (MLM) and Contrastive Learning. A custom tokenizer was also trained from scratch to better handle scientific terminology across the three languages.
1. Masked Language Modeling (MLM)
In the first stage, the model was trained from scratch using the MLM objective on 31 million scientific abstracts collected from Elibrary, Semantic Scholar (S2), and ScienceChina.
2. Contrastive Learning
In the second stage, the model was fine-tuned using contrastive learning to optimize for semantic similarity.
Training Data & Sampling (Contrastive Stage)
The model is trained on data from three primary sources:
- Elibrary (Ru-En)
- Semantic Scholar (S2) (En)
- ScienceChina (Zh-En)
Two types of datasets were used during training:
Title - Abstract Pairs:
- Pairs consist of the title and abstract of the same paper.
- Cross-lingual sampling: Upsampling (8x) is applied to cross-lingual pairs (e.g., Russian Title - English Abstract) to improve multilingual vector space alignment.
Citation and Co-citation Pairs:
- Pairs consist of two related papers (e.g., Article A cites Article B).
- Text Selection: For each article, either the title or abstract is randomly selected.
- Cross-lingual sampling: There is a 50% probability of selecting texts in different languages (if available) to reinforce multilinguality.
Dataset Statistics
| Language Pair | Dataset Type | Examples | Tokens |
|---|---|---|---|
| ru/en | title-abstract | 17,727,817 | 4.53B |
| ru/en | co-citation | 33,682,590 | 18.95B |
| ru/en | citation | 39,988,291 | 22.91B |
| zh/en | title-abstract | 4,643,720 | 2.17B |
| zh/en | citation | 9,181,506 | 8.89B |
| en | title-abstract | 30,561,536 | 9.05B |
| en | citation | 13,307,255 | 7.35B |
| en | co-citation | 61,950,491 | 34.20B |
| Total | 211,043,206 | 108.04B |
- Downloads last month
- 37