rvo commited on
Commit
6614f0d
·
verified ·
1 Parent(s): c41b585

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -217
README.md CHANGED
@@ -1,218 +1,218 @@
1
- ---
2
- license: apache-2.0
3
- base_model: microsoft/MiniLM-L6-v2
4
- tags:
5
- - transformers
6
- - sentence-transformers
7
- - sentence-similarity
8
- - feature-extraction
9
- - text-embeddings-inference
10
- - information-retrieval
11
- - knowledge-distillation
12
- language:
13
- - en
14
- ---
15
- <div style="display: flex; justify-content: center;">
16
- <div style="display: flex; align-items: center; gap: 10px;">
17
- <img src="logo.webp" alt="MongoDB Logo" style="height: 36px; width: auto; border-radius: 4px;">
18
- <span style="font-size: 32px; font-weight: bold">MongoDB/mdbr-leaf-mt</span>
19
- </div>
20
- </div>
21
-
22
- # Content
23
-
24
- 1. [Introduction](#introduction)
25
- 2. [Technical Report](#technical-report)
26
- 3. [Highlights](#highlights)
27
- 4. [Benchmarks](#benchmark-comparison)
28
- 5. [Quickstart](#quickstart)
29
- 6. [Citation](#citation)
30
-
31
- # Introduction
32
-
33
- `mdbr-leaf-mt` is a compact high-performance text embedding model designed for classification, clustering, semantic sentence similarity and summarization tasks.
34
-
35
- To enable even greater efficiency, `mdbr-leaf-mt` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl-truncation).
36
-
37
- If you are looking to perform semantic search / information retrieval (e.g. for RAGs), please check out our [`mdbr-leaf-ir`](https://huggingface.co/MongoDB/mdbr-leaf-ir) model, which is specifically trained for these tasks.
38
-
39
- > [!Note]
40
- > **Note**: this model has been developed by the ML team of MongoDB Research. At the time of writing it is not used in any of MongoDB's commercial product or service offerings.
41
-
42
- # Technical Report
43
-
44
- A technical report detailing our proposed `LEAF` training procedure will be available soon.
45
-
46
- # Highlights
47
-
48
- * **State-of-the-Art Performance**: `mdbr-leaf-mt` achieves new state-of-the-art results for compact embedding models, **ranking #1** on the [public MTEB v2 (Eng) benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for models with ≤30M parameters.
49
- * **Flexible Architecture Support**: `mdbr-leaf-mt` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
50
- * **MRL and Quantization Support**: embedding vectors generated by `mdbr-leaf-mt` compress well when truncated (MRL) and can be stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
51
-
52
- ## Benchmark Comparison
53
-
54
- The table below shows the scores for `mdbr-leaf-mt` on the MTEB v2 (English) benchmark, compared to other retrieval models.
55
-
56
- `mdbr-leaf-mt` ranks #1 on this benchmark for models with <30M parameters.
57
-
58
- | Model | Size | MTEB v2 (Eng) |
59
- |------------------------------------|---------|---------------|
60
- | OpenAI text-embedding-3-large | Unknown | 66.43 |
61
- | OpenAI text-embedding-3-small | Unknown | 64.56 |
62
- | **mdbr-leaf-mt** | 23M | **63.97** |
63
- | gte-small | 33M | 63.22 |
64
- | snowflake-arctic-embed-s | 32M | 61.59 |
65
- | e5-small-v2 | 33M | 61.32 |
66
- | granite-embedding-small-english-r2 | 47M | 61.07 |
67
- | all-MiniLM-L6-v2 | 22M | 59.03 |
68
-
69
-
70
- # Quickstart
71
-
72
- ## Sentence Transformers
73
-
74
- ```python
75
- from sentence_transformers import SentenceTransformer
76
-
77
- # Load the model
78
- model = SentenceTransformer("MongoDB/mdbr-leaf-mt")
79
-
80
- # Example queries and documents
81
- queries = [
82
- "What is machine learning?",
83
- "How does neural network training work?"
84
- ]
85
-
86
- documents = [
87
- "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
88
- "Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
89
- ]
90
-
91
- # Encode queries and documents
92
- query_embeddings = model.encode(queries, prompt_name="query")
93
- document_embeddings = model.encode(documents)
94
-
95
- # Compute similarity scores
96
- scores = model.similarity(query_embeddings, document_embeddings)
97
-
98
- # Print results
99
- for i, query in enumerate(queries):
100
- print(f"Query: {query}")
101
- for j, doc in enumerate(documents):
102
- print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
103
-
104
- # Query: What is machine learning?
105
- # Similarity: 0.9063 | Document 0: Machine learning is a subset of ...
106
- # Similarity: 0.7287 | Document 1: Neural networks are trained ...
107
- #
108
- # Query: How does neural network training work?
109
- # Similarity: 0.6725 | Document 0: Machine learning is a subset of ...
110
- # Similarity: 0.8287 | Document 1: Neural networks are trained ...
111
- ```
112
-
113
- ## Transformers Usage
114
-
115
- See [here](https://huggingface.co/MongoDB/mdbr-leaf-mt/blob/main/transformers_example_mt.ipynb).
116
-
117
- ## Asymmetric Retrieval Setup
118
-
119
- `mdbr-leaf-mt` is *aligned* to [`mxbai-embed-large-v1`](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1), the model it has been distilled from, making the asymmetric system below possible:
120
-
121
- ```python
122
- # Use mdbr-leaf-mt for query encoding (real-time, low latency)
123
- query_model = SentenceTransformer("MongoDB/mdbr-leaf-mt")
124
- query_embeddings = query_model.encode(queries, prompt_name="query")
125
-
126
- # Use a larger model for document encoding (one-time, at index time)
127
- doc_model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
128
- document_embeddings = doc_model.encode(documents)
129
-
130
- # Compute similarities
131
- scores = query_model.similarity(query_embeddings, document_embeddings)
132
- ```
133
- Retrieval results from asymmetric mode are usually superior to the [standard mode above](#sentence-transformers).
134
-
135
- ## MRL Truncation
136
-
137
- Embeddings have been trained via [MRL](https://arxiv.org/abs/2205.13147) and can be truncated for more efficient storage:
138
- ```python
139
- from torch.nn import functional as F
140
-
141
- query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
142
- doc_embeds = model.encode(documents, convert_to_tensor=True)
143
-
144
- # Truncate and normalize according to MRL
145
- query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
146
- doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
147
-
148
- similarities = model.similarity(query_embeds, doc_embeds)
149
-
150
- print('After MRL:')
151
- print(f"* Embeddings dimension: {query_embeds.shape[1]}")
152
- print(f"* Similarities:\n\t{similarities}")
153
-
154
- # After MRL:
155
- # * Embeddings dimension: 256
156
- # * Similarities:
157
- # tensor([[0.9164, 0.7219],
158
- # [0.6682, 0.8393]], device='cuda:0')
159
- ```
160
-
161
- ## Vector Quantization
162
- Vector quantization, for example to `int8` or `binary`, can be performed as follows:
163
-
164
- **Note**: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, [see here](https://sbert.net/examples/sentence_transformer/applications/embedding-quantization/README.html#scalar-int8-quantization).
165
- Good initial values are -1.0 and +1.0.
166
- ```python
167
- from sentence_transformers.quantization import quantize_embeddings
168
- import torch
169
-
170
- query_embeds = model.encode(queries, prompt_name="query")
171
- doc_embeds = model.encode(documents)
172
-
173
- # Quantize embeddings to int8 using -1.0 and +1.0
174
- ranges = torch.tensor([[-1.0], [+1.0]]).expand(2, query_embeds.shape[1]).cpu().numpy()
175
- query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
176
- doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
177
-
178
- # Calculate similarities; cast to int64 to avoid under/overflow
179
- similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
180
-
181
- print('After quantization:')
182
- print(f"* Embeddings type: {query_embeds.dtype}")
183
- print(f"* Similarities:\n{similarities}")
184
-
185
- # After quantization:
186
- # * Embeddings type: int8
187
- # * Similarities:
188
- # [[2202032 1422868]
189
- # [1421197 1845580]]
190
- ```
191
-
192
- # Evaluation
193
-
194
- Please [see here](https://huggingface.co/MongoDB/mdbr-leaf-mt/blob/main/evaluate_models.ipynb).
195
-
196
- # Citation
197
-
198
- If you use this model in your work, please cite:
199
-
200
- ```bibtex
201
- @article{mdb_leaf,
202
- title = {LEAF: Lightweight Embedding Alignment Knowledge Distillation Framework},
203
- author = {Robin Vujanic and Thomas Rueckstiess},
204
- year = {2025}
205
- eprint = {TBD},
206
- archiveprefix = {arXiv},
207
- primaryclass = {FILL HERE},
208
- url = {FILL HERE}
209
- }
210
- ```
211
-
212
- # License
213
-
214
- This model is released under Apache 2.0 License.
215
-
216
- # Contact
217
-
218
  For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML Research team at [email protected].
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: microsoft/MiniLM-L6-v2
4
+ tags:
5
+ - transformers
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - text-embeddings-inference
10
+ - information-retrieval
11
+ - knowledge-distillation
12
+ language:
13
+ - en
14
+ ---
15
+ <div style="display: flex; justify-content: center;">
16
+ <div style="display: flex; align-items: center; gap: 10px;">
17
+ <img src="logo.webp" alt="MongoDB Logo" style="height: 36px; width: auto; border-radius: 4px;">
18
+ <span style="font-size: 32px; font-weight: bold">MongoDB/mdbr-leaf-mt</span>
19
+ </div>
20
+ </div>
21
+
22
+ # Content
23
+
24
+ 1. [Introduction](#introduction)
25
+ 2. [Technical Report](#technical-report)
26
+ 3. [Highlights](#highlights)
27
+ 4. [Benchmarks](#benchmark-comparison)
28
+ 5. [Quickstart](#quickstart)
29
+ 6. [Citation](#citation)
30
+
31
+ # Introduction
32
+
33
+ `mdbr-leaf-mt` is a compact high-performance text embedding model designed for classification, clustering, semantic sentence similarity and summarization tasks.
34
+
35
+ To enable even greater efficiency, `mdbr-leaf-mt` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl-truncation).
36
+
37
+ If you are looking to perform semantic search / information retrieval (e.g. for RAGs), please check out our [`mdbr-leaf-ir`](https://huggingface.co/MongoDB/mdbr-leaf-ir) model, which is specifically trained for these tasks.
38
+
39
+ > [!Note]
40
+ > **Note**: this model has been developed by the ML team of MongoDB Research. At the time of writing it is not used in any of MongoDB's commercial product or service offerings.
41
+
42
+ # Technical Report
43
+
44
+ A technical report detailing our proposed `LEAF` training procedure is [available here](https://arxiv.org/abs/2509.12539).
45
+
46
+ # Highlights
47
+
48
+ * **State-of-the-Art Performance**: `mdbr-leaf-mt` achieves new state-of-the-art results for compact embedding models, **ranking #1** on the [public MTEB v2 (Eng) benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for models with ≤30M parameters.
49
+ * **Flexible Architecture Support**: `mdbr-leaf-mt` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
50
+ * **MRL and Quantization Support**: embedding vectors generated by `mdbr-leaf-mt` compress well when truncated (MRL) and can be stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
51
+
52
+ ## Benchmark Comparison
53
+
54
+ The table below shows the scores for `mdbr-leaf-mt` on the MTEB v2 (English) benchmark, compared to other retrieval models.
55
+
56
+ `mdbr-leaf-mt` ranks #1 on this benchmark for models with <30M parameters.
57
+
58
+ | Model | Size | MTEB v2 (Eng) |
59
+ |------------------------------------|---------|---------------|
60
+ | OpenAI text-embedding-3-large | Unknown | 66.43 |
61
+ | OpenAI text-embedding-3-small | Unknown | 64.56 |
62
+ | **mdbr-leaf-mt** | 23M | **63.97** |
63
+ | gte-small | 33M | 63.22 |
64
+ | snowflake-arctic-embed-s | 32M | 61.59 |
65
+ | e5-small-v2 | 33M | 61.32 |
66
+ | granite-embedding-small-english-r2 | 47M | 61.07 |
67
+ | all-MiniLM-L6-v2 | 22M | 59.03 |
68
+
69
+
70
+ # Quickstart
71
+
72
+ ## Sentence Transformers
73
+
74
+ ```python
75
+ from sentence_transformers import SentenceTransformer
76
+
77
+ # Load the model
78
+ model = SentenceTransformer("MongoDB/mdbr-leaf-mt")
79
+
80
+ # Example queries and documents
81
+ queries = [
82
+ "What is machine learning?",
83
+ "How does neural network training work?"
84
+ ]
85
+
86
+ documents = [
87
+ "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
88
+ "Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
89
+ ]
90
+
91
+ # Encode queries and documents
92
+ query_embeddings = model.encode(queries, prompt_name="query")
93
+ document_embeddings = model.encode(documents)
94
+
95
+ # Compute similarity scores
96
+ scores = model.similarity(query_embeddings, document_embeddings)
97
+
98
+ # Print results
99
+ for i, query in enumerate(queries):
100
+ print(f"Query: {query}")
101
+ for j, doc in enumerate(documents):
102
+ print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
103
+
104
+ # Query: What is machine learning?
105
+ # Similarity: 0.9063 | Document 0: Machine learning is a subset of ...
106
+ # Similarity: 0.7287 | Document 1: Neural networks are trained ...
107
+ #
108
+ # Query: How does neural network training work?
109
+ # Similarity: 0.6725 | Document 0: Machine learning is a subset of ...
110
+ # Similarity: 0.8287 | Document 1: Neural networks are trained ...
111
+ ```
112
+
113
+ ## Transformers Usage
114
+
115
+ See [here](https://huggingface.co/MongoDB/mdbr-leaf-mt/blob/main/transformers_example_mt.ipynb).
116
+
117
+ ## Asymmetric Retrieval Setup
118
+
119
+ `mdbr-leaf-mt` is *aligned* to [`mxbai-embed-large-v1`](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1), the model it has been distilled from, making the asymmetric system below possible:
120
+
121
+ ```python
122
+ # Use mdbr-leaf-mt for query encoding (real-time, low latency)
123
+ query_model = SentenceTransformer("MongoDB/mdbr-leaf-mt")
124
+ query_embeddings = query_model.encode(queries, prompt_name="query")
125
+
126
+ # Use a larger model for document encoding (one-time, at index time)
127
+ doc_model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
128
+ document_embeddings = doc_model.encode(documents)
129
+
130
+ # Compute similarities
131
+ scores = query_model.similarity(query_embeddings, document_embeddings)
132
+ ```
133
+ Retrieval results from asymmetric mode are usually superior to the [standard mode above](#sentence-transformers).
134
+
135
+ ## MRL Truncation
136
+
137
+ Embeddings have been trained via [MRL](https://arxiv.org/abs/2205.13147) and can be truncated for more efficient storage:
138
+ ```python
139
+ from torch.nn import functional as F
140
+
141
+ query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
142
+ doc_embeds = model.encode(documents, convert_to_tensor=True)
143
+
144
+ # Truncate and normalize according to MRL
145
+ query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
146
+ doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
147
+
148
+ similarities = model.similarity(query_embeds, doc_embeds)
149
+
150
+ print('After MRL:')
151
+ print(f"* Embeddings dimension: {query_embeds.shape[1]}")
152
+ print(f"* Similarities:\n\t{similarities}")
153
+
154
+ # After MRL:
155
+ # * Embeddings dimension: 256
156
+ # * Similarities:
157
+ # tensor([[0.9164, 0.7219],
158
+ # [0.6682, 0.8393]], device='cuda:0')
159
+ ```
160
+
161
+ ## Vector Quantization
162
+ Vector quantization, for example to `int8` or `binary`, can be performed as follows:
163
+
164
+ **Note**: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, [see here](https://sbert.net/examples/sentence_transformer/applications/embedding-quantization/README.html#scalar-int8-quantization).
165
+ Good initial values are -1.0 and +1.0.
166
+ ```python
167
+ from sentence_transformers.quantization import quantize_embeddings
168
+ import torch
169
+
170
+ query_embeds = model.encode(queries, prompt_name="query")
171
+ doc_embeds = model.encode(documents)
172
+
173
+ # Quantize embeddings to int8 using -1.0 and +1.0
174
+ ranges = torch.tensor([[-1.0], [+1.0]]).expand(2, query_embeds.shape[1]).cpu().numpy()
175
+ query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
176
+ doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
177
+
178
+ # Calculate similarities; cast to int64 to avoid under/overflow
179
+ similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
180
+
181
+ print('After quantization:')
182
+ print(f"* Embeddings type: {query_embeds.dtype}")
183
+ print(f"* Similarities:\n{similarities}")
184
+
185
+ # After quantization:
186
+ # * Embeddings type: int8
187
+ # * Similarities:
188
+ # [[2202032 1422868]
189
+ # [1421197 1845580]]
190
+ ```
191
+
192
+ # Evaluation
193
+
194
+ Please [see here](https://huggingface.co/MongoDB/mdbr-leaf-mt/blob/main/evaluate_models.ipynb).
195
+
196
+ # Citation
197
+
198
+ If you use this model in your work, please cite:
199
+
200
+ ```bibtex
201
+ @misc{mdbr_leaf,
202
+ title={LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations},
203
+ author={Robin Vujanic and Thomas Rueckstiess},
204
+ year={2025},
205
+ eprint={2509.12539},
206
+ archivePrefix={arXiv},
207
+ primaryClass={cs.IR},
208
+ url={https://arxiv.org/abs/2509.12539},
209
+ }
210
+ ```
211
+
212
+ # License
213
+
214
+ This model is released under Apache 2.0 License.
215
+
216
+ # Contact
217
+
218
  For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML Research team at [email protected].