mlsa-iai-msu-lab/sci-rus-tiny3.5-zh

Model Description

This is a multilingual encoder model designed for scientific text embeddings, supporting Russian, English, and Chinese languages. It is trained on a large corpus of scientific papers, citations, and co-citations to capture semantic similarity across these languages.

Training Stages

The model was trained from scratch in two stages: Masked Language Modeling (MLM) and Contrastive Learning. A custom tokenizer was also trained from scratch to better handle scientific terminology across the three languages.

1. Masked Language Modeling (MLM)

In the first stage, the model was trained from scratch using the MLM objective on 31 million scientific abstracts collected from Elibrary, Semantic Scholar (S2), and ScienceChina.

2. Contrastive Learning

In the second stage, the model was fine-tuned using contrastive learning to optimize for semantic similarity.

Training Data & Sampling (Contrastive Stage)

The model is trained on data from three primary sources:

  • Elibrary (Ru-En)
  • Semantic Scholar (S2) (En)
  • ScienceChina (Zh-En)

Two types of datasets were used during training:

  1. Title - Abstract Pairs:

    • Pairs consist of the title and abstract of the same paper.
    • Cross-lingual sampling: Upsampling (8x) is applied to cross-lingual pairs (e.g., Russian Title - English Abstract) to improve multilingual vector space alignment.
  2. Citation and Co-citation Pairs:

    • Pairs consist of two related papers (e.g., Article A cites Article B).
    • Text Selection: For each article, either the title or abstract is randomly selected.
    • Cross-lingual sampling: There is a 50% probability of selecting texts in different languages (if available) to reinforce multilinguality.

Dataset Statistics

Language Pair Dataset Type Examples Tokens
ru/en title-abstract 17,727,817 4.53B
ru/en co-citation 33,682,590 18.95B
ru/en citation 39,988,291 22.91B
zh/en title-abstract 4,643,720 2.17B
zh/en citation 9,181,506 8.89B
en title-abstract 30,561,536 9.05B
en citation 13,307,255 7.35B
en co-citation 61,950,491 34.20B
Total 211,043,206 108.04B
Downloads last month
37
Safetensors
Model size
20.1M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support