CatSchroedinger commited on
Commit
eab3029
·
verified ·
1 Parent(s): 6851c01

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,739 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - generated_from_trainer
10
+ - dataset_size:6300
11
+ - loss:MatryoshkaLoss
12
+ - loss:MultipleNegativesRankingLoss
13
+ base_model: nomic-ai/nomic-embed-text-v1.5
14
+ widget:
15
+ - source_sentence: What contributed to the increase in accounts receivable in 2023
16
+ compared to the previous year?
17
+ sentences:
18
+ - At December 31, 2023, Caterpillar’s consolidated net worth was $19.55 billion,
19
+ which was above the $9.00 billion required under the Credit Facility.
20
+ - Accounts receivable increased to $3,154 million at October 29, 2023 from $2,958
21
+ million at October 30, 2022, primarily due to revenue linearity, offset in part
22
+ by additional receivables sold through factoring arrangements.
23
+ - The EU adopted the Carbon Border Adjustment Mechanism (CBAM), which subjects certain
24
+ imported materials such as iron, steel, and aluminum, to a carbon levy linked
25
+ to the carbon price payable on domestic goods under the European Trading Scheme.
26
+ The CBAM could increase costs of importing such materials and/or limit the ability
27
+ to import lower cost materials from non-EU countries.
28
+ - source_sentence: What primarily constituted marketing expenses for the year ended
29
+ December 31, 2025?
30
+ sentences:
31
+ - This is the world’s first liquefied natural gas (LNG) plant supplied with associated
32
+ gas, where the natural gas is a byproduct of crude oil production.
33
+ - Marketing expenses consist primarily of advertising expenses and certain payments
34
+ made to our marketing and advertising sales partners.
35
+ - In January 2024, the CFPB proposed a rule that could significantly restrict bank
36
+ overdraft fees.
37
+ - source_sentence: What is the total cash flow from operating activities for Airbnb,
38
+ Inc. in 2023?
39
+ sentences:
40
+ - Net cash provided by operating activities | 2,313 | | 3,430 | | 3,884
41
+ - Under ASO contracts, self-funded employers generally retain the risk of financing
42
+ the costs of health benefits, with large group customers retaining a greater share
43
+ and small group customers a smaller share of the cost of health benefits.
44
+ - Of the $2.2 billion in revenue that we generated in 2023, 55% came from customers
45
+ in the government segment, and 45% came from customers in the commercial segment.
46
+ - source_sentence: What major legislative act mentioned in the text was enacted by
47
+ the U.S. government on August 16, 2022?
48
+ sentences:
49
+ - In July 2022, the borrowing capacity under the back-up facilities expanded from
50
+ $3.0 billion to $5.0 billion.
51
+ - The Company manages its investment portfolio to limit its exposure to any one
52
+ issuer or market sector, and largely limits its investments to investmententious
53
+ grade quality.
54
+ - On August 16, 2022, the U.S. government enacted the Inflation Reduction Act of
55
+ 2022.
56
+ - source_sentence: What is the maximum duration for patent term restoration for pharmaceutical
57
+ products in the U.S.?
58
+ sentences:
59
+ - Patent term restoration for a single patent for a pharmaceutical product is provided
60
+ to U.S. patent holders to compensate for a portion of the time invested in clinical
61
+ trials and the U.S. Food and Drug Administration (FDA). There is a five-year cap
62
+ on any restoration, and no patent's expiration date may be extended beyond 14
63
+ years from FDA approval.
64
+ - Using AI technologies, our Tax Advisor offering leverages information generated
65
+ from our ProConnect Tax Online and Lacerte offerings to enable year-round tax
66
+ planning services and communicate tax savings strategies to clients.
67
+ - In 2023, catastrophe losses were primarily due to U.S. flooding, hail, tornadoes,
68
+ and wind events.
69
+ pipeline_tag: sentence-similarity
70
+ library_name: sentence-transformers
71
+ metrics:
72
+ - cosine_accuracy@1
73
+ - cosine_accuracy@3
74
+ - cosine_accuracy@5
75
+ - cosine_accuracy@10
76
+ - cosine_precision@1
77
+ - cosine_precision@3
78
+ - cosine_precision@5
79
+ - cosine_precision@10
80
+ - cosine_recall@1
81
+ - cosine_recall@3
82
+ - cosine_recall@5
83
+ - cosine_recall@10
84
+ - cosine_ndcg@10
85
+ - cosine_mrr@10
86
+ - cosine_map@100
87
+ model-index:
88
+ - name: nomic 1.5 base Financial Matryoshka
89
+ results:
90
+ - task:
91
+ type: information-retrieval
92
+ name: Information Retrieval
93
+ dataset:
94
+ name: dim 768
95
+ type: dim_768
96
+ metrics:
97
+ - type: cosine_accuracy@1
98
+ value: 0.7285714285714285
99
+ name: Cosine Accuracy@1
100
+ - type: cosine_accuracy@3
101
+ value: 0.8514285714285714
102
+ name: Cosine Accuracy@3
103
+ - type: cosine_accuracy@5
104
+ value: 0.8885714285714286
105
+ name: Cosine Accuracy@5
106
+ - type: cosine_accuracy@10
107
+ value: 0.92
108
+ name: Cosine Accuracy@10
109
+ - type: cosine_precision@1
110
+ value: 0.7285714285714285
111
+ name: Cosine Precision@1
112
+ - type: cosine_precision@3
113
+ value: 0.28380952380952384
114
+ name: Cosine Precision@3
115
+ - type: cosine_precision@5
116
+ value: 0.17771428571428569
117
+ name: Cosine Precision@5
118
+ - type: cosine_precision@10
119
+ value: 0.09199999999999998
120
+ name: Cosine Precision@10
121
+ - type: cosine_recall@1
122
+ value: 0.7285714285714285
123
+ name: Cosine Recall@1
124
+ - type: cosine_recall@3
125
+ value: 0.8514285714285714
126
+ name: Cosine Recall@3
127
+ - type: cosine_recall@5
128
+ value: 0.8885714285714286
129
+ name: Cosine Recall@5
130
+ - type: cosine_recall@10
131
+ value: 0.92
132
+ name: Cosine Recall@10
133
+ - type: cosine_ndcg@10
134
+ value: 0.82688931465871
135
+ name: Cosine Ndcg@10
136
+ - type: cosine_mrr@10
137
+ value: 0.7967777777777774
138
+ name: Cosine Mrr@10
139
+ - type: cosine_map@100
140
+ value: 0.8005981271078951
141
+ name: Cosine Map@100
142
+ - task:
143
+ type: information-retrieval
144
+ name: Information Retrieval
145
+ dataset:
146
+ name: dim 512
147
+ type: dim_512
148
+ metrics:
149
+ - type: cosine_accuracy@1
150
+ value: 0.7142857142857143
151
+ name: Cosine Accuracy@1
152
+ - type: cosine_accuracy@3
153
+ value: 0.8414285714285714
154
+ name: Cosine Accuracy@3
155
+ - type: cosine_accuracy@5
156
+ value: 0.8857142857142857
157
+ name: Cosine Accuracy@5
158
+ - type: cosine_accuracy@10
159
+ value: 0.9214285714285714
160
+ name: Cosine Accuracy@10
161
+ - type: cosine_precision@1
162
+ value: 0.7142857142857143
163
+ name: Cosine Precision@1
164
+ - type: cosine_precision@3
165
+ value: 0.2804761904761905
166
+ name: Cosine Precision@3
167
+ - type: cosine_precision@5
168
+ value: 0.17714285714285713
169
+ name: Cosine Precision@5
170
+ - type: cosine_precision@10
171
+ value: 0.09214285714285714
172
+ name: Cosine Precision@10
173
+ - type: cosine_recall@1
174
+ value: 0.7142857142857143
175
+ name: Cosine Recall@1
176
+ - type: cosine_recall@3
177
+ value: 0.8414285714285714
178
+ name: Cosine Recall@3
179
+ - type: cosine_recall@5
180
+ value: 0.8857142857142857
181
+ name: Cosine Recall@5
182
+ - type: cosine_recall@10
183
+ value: 0.9214285714285714
184
+ name: Cosine Recall@10
185
+ - type: cosine_ndcg@10
186
+ value: 0.8200375337187854
187
+ name: Cosine Ndcg@10
188
+ - type: cosine_mrr@10
189
+ value: 0.7872664399092969
190
+ name: Cosine Mrr@10
191
+ - type: cosine_map@100
192
+ value: 0.7910342395417198
193
+ name: Cosine Map@100
194
+ - task:
195
+ type: information-retrieval
196
+ name: Information Retrieval
197
+ dataset:
198
+ name: dim 256
199
+ type: dim_256
200
+ metrics:
201
+ - type: cosine_accuracy@1
202
+ value: 0.7014285714285714
203
+ name: Cosine Accuracy@1
204
+ - type: cosine_accuracy@3
205
+ value: 0.8385714285714285
206
+ name: Cosine Accuracy@3
207
+ - type: cosine_accuracy@5
208
+ value: 0.88
209
+ name: Cosine Accuracy@5
210
+ - type: cosine_accuracy@10
211
+ value: 0.9242857142857143
212
+ name: Cosine Accuracy@10
213
+ - type: cosine_precision@1
214
+ value: 0.7014285714285714
215
+ name: Cosine Precision@1
216
+ - type: cosine_precision@3
217
+ value: 0.2795238095238095
218
+ name: Cosine Precision@3
219
+ - type: cosine_precision@5
220
+ value: 0.176
221
+ name: Cosine Precision@5
222
+ - type: cosine_precision@10
223
+ value: 0.09242857142857142
224
+ name: Cosine Precision@10
225
+ - type: cosine_recall@1
226
+ value: 0.7014285714285714
227
+ name: Cosine Recall@1
228
+ - type: cosine_recall@3
229
+ value: 0.8385714285714285
230
+ name: Cosine Recall@3
231
+ - type: cosine_recall@5
232
+ value: 0.88
233
+ name: Cosine Recall@5
234
+ - type: cosine_recall@10
235
+ value: 0.9242857142857143
236
+ name: Cosine Recall@10
237
+ - type: cosine_ndcg@10
238
+ value: 0.8144449051665447
239
+ name: Cosine Ndcg@10
240
+ - type: cosine_mrr@10
241
+ value: 0.7791428571428568
242
+ name: Cosine Mrr@10
243
+ - type: cosine_map@100
244
+ value: 0.7827133843260672
245
+ name: Cosine Map@100
246
+ - task:
247
+ type: information-retrieval
248
+ name: Information Retrieval
249
+ dataset:
250
+ name: dim 128
251
+ type: dim_128
252
+ metrics:
253
+ - type: cosine_accuracy@1
254
+ value: 0.7071428571428572
255
+ name: Cosine Accuracy@1
256
+ - type: cosine_accuracy@3
257
+ value: 0.8342857142857143
258
+ name: Cosine Accuracy@3
259
+ - type: cosine_accuracy@5
260
+ value: 0.8742857142857143
261
+ name: Cosine Accuracy@5
262
+ - type: cosine_accuracy@10
263
+ value: 0.9242857142857143
264
+ name: Cosine Accuracy@10
265
+ - type: cosine_precision@1
266
+ value: 0.7071428571428572
267
+ name: Cosine Precision@1
268
+ - type: cosine_precision@3
269
+ value: 0.2780952380952381
270
+ name: Cosine Precision@3
271
+ - type: cosine_precision@5
272
+ value: 0.17485714285714282
273
+ name: Cosine Precision@5
274
+ - type: cosine_precision@10
275
+ value: 0.09242857142857142
276
+ name: Cosine Precision@10
277
+ - type: cosine_recall@1
278
+ value: 0.7071428571428572
279
+ name: Cosine Recall@1
280
+ - type: cosine_recall@3
281
+ value: 0.8342857142857143
282
+ name: Cosine Recall@3
283
+ - type: cosine_recall@5
284
+ value: 0.8742857142857143
285
+ name: Cosine Recall@5
286
+ - type: cosine_recall@10
287
+ value: 0.9242857142857143
288
+ name: Cosine Recall@10
289
+ - type: cosine_ndcg@10
290
+ value: 0.8164938269316206
291
+ name: Cosine Ndcg@10
292
+ - type: cosine_mrr@10
293
+ value: 0.78222052154195
294
+ name: Cosine Mrr@10
295
+ - type: cosine_map@100
296
+ value: 0.7855606408045326
297
+ name: Cosine Map@100
298
+ - task:
299
+ type: information-retrieval
300
+ name: Information Retrieval
301
+ dataset:
302
+ name: dim 64
303
+ type: dim_64
304
+ metrics:
305
+ - type: cosine_accuracy@1
306
+ value: 0.67
307
+ name: Cosine Accuracy@1
308
+ - type: cosine_accuracy@3
309
+ value: 0.8085714285714286
310
+ name: Cosine Accuracy@3
311
+ - type: cosine_accuracy@5
312
+ value: 0.8514285714285714
313
+ name: Cosine Accuracy@5
314
+ - type: cosine_accuracy@10
315
+ value: 0.8985714285714286
316
+ name: Cosine Accuracy@10
317
+ - type: cosine_precision@1
318
+ value: 0.67
319
+ name: Cosine Precision@1
320
+ - type: cosine_precision@3
321
+ value: 0.26952380952380955
322
+ name: Cosine Precision@3
323
+ - type: cosine_precision@5
324
+ value: 0.17028571428571426
325
+ name: Cosine Precision@5
326
+ - type: cosine_precision@10
327
+ value: 0.08985714285714284
328
+ name: Cosine Precision@10
329
+ - type: cosine_recall@1
330
+ value: 0.67
331
+ name: Cosine Recall@1
332
+ - type: cosine_recall@3
333
+ value: 0.8085714285714286
334
+ name: Cosine Recall@3
335
+ - type: cosine_recall@5
336
+ value: 0.8514285714285714
337
+ name: Cosine Recall@5
338
+ - type: cosine_recall@10
339
+ value: 0.8985714285714286
340
+ name: Cosine Recall@10
341
+ - type: cosine_ndcg@10
342
+ value: 0.7841742147445607
343
+ name: Cosine Ndcg@10
344
+ - type: cosine_mrr@10
345
+ value: 0.7477647392290245
346
+ name: Cosine Mrr@10
347
+ - type: cosine_map@100
348
+ value: 0.7519306451620806
349
+ name: Cosine Map@100
350
+ ---
351
+
352
+ # nomic 1.5 base Financial Matryoshka
353
+
354
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
355
+
356
+ ## Model Details
357
+
358
+ ### Model Description
359
+ - **Model Type:** Sentence Transformer
360
+ - **Base model:** [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) <!-- at revision ac6fcd72429d86ff25c17895e47a9bfcfc50c1b2 -->
361
+ - **Maximum Sequence Length:** 8192 tokens
362
+ - **Output Dimensionality:** 768 dimensions
363
+ - **Similarity Function:** Cosine Similarity
364
+ - **Training Dataset:**
365
+ - json
366
+ - **Language:** en
367
+ - **License:** apache-2.0
368
+
369
+ ### Model Sources
370
+
371
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
372
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
373
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
374
+
375
+ ### Full Model Architecture
376
+
377
+ ```
378
+ SentenceTransformer(
379
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
380
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
381
+ )
382
+ ```
383
+
384
+ ## Usage
385
+
386
+ ### Direct Usage (Sentence Transformers)
387
+
388
+ First install the Sentence Transformers library:
389
+
390
+ ```bash
391
+ pip install -U sentence-transformers
392
+ ```
393
+
394
+ Then you can load this model and run inference.
395
+ ```python
396
+ from sentence_transformers import SentenceTransformer
397
+
398
+ # Download from the 🤗 Hub
399
+ model = SentenceTransformer("CatSchroedinger/nomic-v1.5-financial-matryoshka")
400
+ # Run inference
401
+ sentences = [
402
+ 'What is the maximum duration for patent term restoration for pharmaceutical products in the U.S.?',
403
+ "Patent term restoration for a single patent for a pharmaceutical product is provided to U.S. patent holders to compensate for a portion of the time invested in clinical trials and the U.S. Food and Drug Administration (FDA). There is a five-year cap on any restoration, and no patent's expiration date may be extended beyond 14 years from FDA approval.",
404
+ 'Using AI technologies, our Tax Advisor offering leverages information generated from our ProConnect Tax Online and Lacerte offerings to enable year-round tax planning services and communicate tax savings strategies to clients.',
405
+ ]
406
+ embeddings = model.encode(sentences)
407
+ print(embeddings.shape)
408
+ # [3, 768]
409
+
410
+ # Get the similarity scores for the embeddings
411
+ similarities = model.similarity(embeddings, embeddings)
412
+ print(similarities.shape)
413
+ # [3, 3]
414
+ ```
415
+
416
+ <!--
417
+ ### Direct Usage (Transformers)
418
+
419
+ <details><summary>Click to see the direct usage in Transformers</summary>
420
+
421
+ </details>
422
+ -->
423
+
424
+ <!--
425
+ ### Downstream Usage (Sentence Transformers)
426
+
427
+ You can finetune this model on your own dataset.
428
+
429
+ <details><summary>Click to expand</summary>
430
+
431
+ </details>
432
+ -->
433
+
434
+ <!--
435
+ ### Out-of-Scope Use
436
+
437
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
438
+ -->
439
+
440
+ ## Evaluation
441
+
442
+ ### Metrics
443
+
444
+ #### Information Retrieval
445
+
446
+ * Datasets: `dim_768`, `dim_512`, `dim_256`, `dim_128` and `dim_64`
447
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
448
+
449
+ | Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 |
450
+ |:--------------------|:-----------|:---------|:-----------|:-----------|:-----------|
451
+ | cosine_accuracy@1 | 0.7286 | 0.7143 | 0.7014 | 0.7071 | 0.67 |
452
+ | cosine_accuracy@3 | 0.8514 | 0.8414 | 0.8386 | 0.8343 | 0.8086 |
453
+ | cosine_accuracy@5 | 0.8886 | 0.8857 | 0.88 | 0.8743 | 0.8514 |
454
+ | cosine_accuracy@10 | 0.92 | 0.9214 | 0.9243 | 0.9243 | 0.8986 |
455
+ | cosine_precision@1 | 0.7286 | 0.7143 | 0.7014 | 0.7071 | 0.67 |
456
+ | cosine_precision@3 | 0.2838 | 0.2805 | 0.2795 | 0.2781 | 0.2695 |
457
+ | cosine_precision@5 | 0.1777 | 0.1771 | 0.176 | 0.1749 | 0.1703 |
458
+ | cosine_precision@10 | 0.092 | 0.0921 | 0.0924 | 0.0924 | 0.0899 |
459
+ | cosine_recall@1 | 0.7286 | 0.7143 | 0.7014 | 0.7071 | 0.67 |
460
+ | cosine_recall@3 | 0.8514 | 0.8414 | 0.8386 | 0.8343 | 0.8086 |
461
+ | cosine_recall@5 | 0.8886 | 0.8857 | 0.88 | 0.8743 | 0.8514 |
462
+ | cosine_recall@10 | 0.92 | 0.9214 | 0.9243 | 0.9243 | 0.8986 |
463
+ | **cosine_ndcg@10** | **0.8269** | **0.82** | **0.8144** | **0.8165** | **0.7842** |
464
+ | cosine_mrr@10 | 0.7968 | 0.7873 | 0.7791 | 0.7822 | 0.7478 |
465
+ | cosine_map@100 | 0.8006 | 0.791 | 0.7827 | 0.7856 | 0.7519 |
466
+
467
+ <!--
468
+ ## Bias, Risks and Limitations
469
+
470
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
471
+ -->
472
+
473
+ <!--
474
+ ### Recommendations
475
+
476
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
477
+ -->
478
+
479
+ ## Training Details
480
+
481
+ ### Training Dataset
482
+
483
+ #### json
484
+
485
+ * Dataset: json
486
+ * Size: 6,300 training samples
487
+ * Columns: <code>anchor</code> and <code>positive</code>
488
+ * Approximate statistics based on the first 1000 samples:
489
+ | | anchor | positive |
490
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
491
+ | type | string | string |
492
+ | details | <ul><li>min: 9 tokens</li><li>mean: 20.49 tokens</li><li>max: 51 tokens</li></ul> | <ul><li>min: 6 tokens</li><li>mean: 45.72 tokens</li><li>max: 687 tokens</li></ul> |
493
+ * Samples:
494
+ | anchor | positive |
495
+ |:-----------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
496
+ | <code>What limitations are associated with using non-GAAP financial measures such as contribution margin and adjusted income from operations?</code> | <code>Further, these metrics have certain limitations, as they do not include the impact of certain expenses that are reflected in our consolidated statements of operations.</code> |
497
+ | <code>What type of firm is PricewaterhouseCoopers LLP as mentioned in the financial statements?</code> | <code>PricewaterhouseCoopers LLP, mentioned as the independent registered public accounting firm with PCAOB ID 238, prepared the report on the consolidated financial statements.</code> |
498
+ | <code>What pages contain the financial Statements and Supplementary Data in IBM's 2023 Annual Report?</code> | <code>The Financial Statements and Supplementary Data for IBM's 2023 Annual Report are found on pages 44 through 121.</code> |
499
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
500
+ ```json
501
+ {
502
+ "loss": "MultipleNegativesRankingLoss",
503
+ "matryoshka_dims": [
504
+ 768,
505
+ 512,
506
+ 256,
507
+ 128,
508
+ 64
509
+ ],
510
+ "matryoshka_weights": [
511
+ 1,
512
+ 1,
513
+ 1,
514
+ 1,
515
+ 1
516
+ ],
517
+ "n_dims_per_step": -1
518
+ }
519
+ ```
520
+
521
+ ### Training Hyperparameters
522
+ #### Non-Default Hyperparameters
523
+
524
+ - `eval_strategy`: epoch
525
+ - `per_device_train_batch_size`: 32
526
+ - `per_device_eval_batch_size`: 16
527
+ - `gradient_accumulation_steps`: 16
528
+ - `learning_rate`: 2e-05
529
+ - `num_train_epochs`: 4
530
+ - `lr_scheduler_type`: cosine
531
+ - `warmup_ratio`: 0.1
532
+ - `bf16`: True
533
+ - `tf32`: False
534
+ - `load_best_model_at_end`: True
535
+ - `optim`: adamw_torch_fused
536
+ - `batch_sampler`: no_duplicates
537
+
538
+ #### All Hyperparameters
539
+ <details><summary>Click to expand</summary>
540
+
541
+ - `overwrite_output_dir`: False
542
+ - `do_predict`: False
543
+ - `eval_strategy`: epoch
544
+ - `prediction_loss_only`: True
545
+ - `per_device_train_batch_size`: 32
546
+ - `per_device_eval_batch_size`: 16
547
+ - `per_gpu_train_batch_size`: None
548
+ - `per_gpu_eval_batch_size`: None
549
+ - `gradient_accumulation_steps`: 16
550
+ - `eval_accumulation_steps`: None
551
+ - `torch_empty_cache_steps`: None
552
+ - `learning_rate`: 2e-05
553
+ - `weight_decay`: 0.0
554
+ - `adam_beta1`: 0.9
555
+ - `adam_beta2`: 0.999
556
+ - `adam_epsilon`: 1e-08
557
+ - `max_grad_norm`: 1.0
558
+ - `num_train_epochs`: 4
559
+ - `max_steps`: -1
560
+ - `lr_scheduler_type`: cosine
561
+ - `lr_scheduler_kwargs`: {}
562
+ - `warmup_ratio`: 0.1
563
+ - `warmup_steps`: 0
564
+ - `log_level`: passive
565
+ - `log_level_replica`: warning
566
+ - `log_on_each_node`: True
567
+ - `logging_nan_inf_filter`: True
568
+ - `save_safetensors`: True
569
+ - `save_on_each_node`: False
570
+ - `save_only_model`: False
571
+ - `restore_callback_states_from_checkpoint`: False
572
+ - `no_cuda`: False
573
+ - `use_cpu`: False
574
+ - `use_mps_device`: False
575
+ - `seed`: 42
576
+ - `data_seed`: None
577
+ - `jit_mode_eval`: False
578
+ - `use_ipex`: False
579
+ - `bf16`: True
580
+ - `fp16`: False
581
+ - `fp16_opt_level`: O1
582
+ - `half_precision_backend`: auto
583
+ - `bf16_full_eval`: False
584
+ - `fp16_full_eval`: False
585
+ - `tf32`: False
586
+ - `local_rank`: 0
587
+ - `ddp_backend`: None
588
+ - `tpu_num_cores`: None
589
+ - `tpu_metrics_debug`: False
590
+ - `debug`: []
591
+ - `dataloader_drop_last`: False
592
+ - `dataloader_num_workers`: 0
593
+ - `dataloader_prefetch_factor`: None
594
+ - `past_index`: -1
595
+ - `disable_tqdm`: False
596
+ - `remove_unused_columns`: True
597
+ - `label_names`: None
598
+ - `load_best_model_at_end`: True
599
+ - `ignore_data_skip`: False
600
+ - `fsdp`: []
601
+ - `fsdp_min_num_params`: 0
602
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
603
+ - `fsdp_transformer_layer_cls_to_wrap`: None
604
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
605
+ - `deepspeed`: None
606
+ - `label_smoothing_factor`: 0.0
607
+ - `optim`: adamw_torch_fused
608
+ - `optim_args`: None
609
+ - `adafactor`: False
610
+ - `group_by_length`: False
611
+ - `length_column_name`: length
612
+ - `ddp_find_unused_parameters`: None
613
+ - `ddp_bucket_cap_mb`: None
614
+ - `ddp_broadcast_buffers`: False
615
+ - `dataloader_pin_memory`: True
616
+ - `dataloader_persistent_workers`: False
617
+ - `skip_memory_metrics`: True
618
+ - `use_legacy_prediction_loop`: False
619
+ - `push_to_hub`: False
620
+ - `resume_from_checkpoint`: None
621
+ - `hub_model_id`: None
622
+ - `hub_strategy`: every_save
623
+ - `hub_private_repo`: None
624
+ - `hub_always_push`: False
625
+ - `gradient_checkpointing`: False
626
+ - `gradient_checkpointing_kwargs`: None
627
+ - `include_inputs_for_metrics`: False
628
+ - `include_for_metrics`: []
629
+ - `eval_do_concat_batches`: True
630
+ - `fp16_backend`: auto
631
+ - `push_to_hub_model_id`: None
632
+ - `push_to_hub_organization`: None
633
+ - `mp_parameters`:
634
+ - `auto_find_batch_size`: False
635
+ - `full_determinism`: False
636
+ - `torchdynamo`: None
637
+ - `ray_scope`: last
638
+ - `ddp_timeout`: 1800
639
+ - `torch_compile`: False
640
+ - `torch_compile_backend`: None
641
+ - `torch_compile_mode`: None
642
+ - `dispatch_batches`: None
643
+ - `split_batches`: None
644
+ - `include_tokens_per_second`: False
645
+ - `include_num_input_tokens_seen`: False
646
+ - `neftune_noise_alpha`: None
647
+ - `optim_target_modules`: None
648
+ - `batch_eval_metrics`: False
649
+ - `eval_on_start`: False
650
+ - `use_liger_kernel`: False
651
+ - `eval_use_gather_object`: False
652
+ - `average_tokens_across_devices`: False
653
+ - `prompts`: None
654
+ - `batch_sampler`: no_duplicates
655
+ - `multi_dataset_batch_sampler`: proportional
656
+
657
+ </details>
658
+
659
+ ### Training Logs
660
+ | Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 |
661
+ |:-------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:|
662
+ | 0.8122 | 10 | 11.5729 | - | - | - | - | - |
663
+ | 1.0 | 13 | - | 0.7995 | 0.7976 | 0.7929 | 0.7889 | 0.7646 |
664
+ | 1.5685 | 20 | 3.4999 | - | - | - | - | - |
665
+ | 2.0 | 26 | - | 0.8207 | 0.8189 | 0.8099 | 0.8090 | 0.7825 |
666
+ | 2.3249 | 30 | 2.8578 | - | - | - | - | - |
667
+ | **3.0** | **39** | **-** | **0.8267** | **0.8218** | **0.8151** | **0.8168** | **0.7826** |
668
+ | 3.0812 | 40 | 2.0904 | - | - | - | - | - |
669
+ | 3.7310 | 48 | - | 0.8269 | 0.8200 | 0.8144 | 0.8165 | 0.7842 |
670
+
671
+ * The bold row denotes the saved checkpoint.
672
+
673
+ ### Framework Versions
674
+ - Python: 3.11.5
675
+ - Sentence Transformers: 3.4.1
676
+ - Transformers: 4.48.2
677
+ - PyTorch: 2.6.0+cu124
678
+ - Accelerate: 1.3.0
679
+ - Datasets: 2.19.1
680
+ - Tokenizers: 0.21.0
681
+
682
+ ## Citation
683
+
684
+ ### BibTeX
685
+
686
+ #### Sentence Transformers
687
+ ```bibtex
688
+ @inproceedings{reimers-2019-sentence-bert,
689
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
690
+ author = "Reimers, Nils and Gurevych, Iryna",
691
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
692
+ month = "11",
693
+ year = "2019",
694
+ publisher = "Association for Computational Linguistics",
695
+ url = "https://arxiv.org/abs/1908.10084",
696
+ }
697
+ ```
698
+
699
+ #### MatryoshkaLoss
700
+ ```bibtex
701
+ @misc{kusupati2024matryoshka,
702
+ title={Matryoshka Representation Learning},
703
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
704
+ year={2024},
705
+ eprint={2205.13147},
706
+ archivePrefix={arXiv},
707
+ primaryClass={cs.LG}
708
+ }
709
+ ```
710
+
711
+ #### MultipleNegativesRankingLoss
712
+ ```bibtex
713
+ @misc{henderson2017efficient,
714
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
715
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
716
+ year={2017},
717
+ eprint={1705.00652},
718
+ archivePrefix={arXiv},
719
+ primaryClass={cs.CL}
720
+ }
721
+ ```
722
+
723
+ <!--
724
+ ## Glossary
725
+
726
+ *Clearly define terms in order to be accessible across audiences.*
727
+ -->
728
+
729
+ <!--
730
+ ## Model Card Authors
731
+
732
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
733
+ -->
734
+
735
+ <!--
736
+ ## Model Card Contact
737
+
738
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
739
+ -->
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nomic-ai/nomic-embed-text-v1.5",
3
+ "activation_function": "swiglu",
4
+ "architectures": [
5
+ "NomicBertModel"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
10
+ "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
11
+ "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
12
+ },
13
+ "bos_token_id": null,
14
+ "causal": false,
15
+ "dense_seq_output": true,
16
+ "embd_pdrop": 0.0,
17
+ "eos_token_id": null,
18
+ "fused_bias_fc": true,
19
+ "fused_dropout_add_ln": true,
20
+ "initializer_range": 0.02,
21
+ "layer_norm_epsilon": 1e-12,
22
+ "max_trained_positions": 2048,
23
+ "mlp_fc1_bias": false,
24
+ "mlp_fc2_bias": false,
25
+ "model_type": "nomic_bert",
26
+ "n_embd": 768,
27
+ "n_head": 12,
28
+ "n_inner": 3072,
29
+ "n_layer": 12,
30
+ "n_positions": 8192,
31
+ "pad_vocab_size_multiple": 64,
32
+ "parallel_block": false,
33
+ "parallel_block_tied_norm": false,
34
+ "prenorm": false,
35
+ "qkv_proj_bias": false,
36
+ "reorder_and_upcast_attn": false,
37
+ "resid_pdrop": 0.0,
38
+ "rotary_emb_base": 1000,
39
+ "rotary_emb_fraction": 1.0,
40
+ "rotary_emb_interleaved": false,
41
+ "rotary_emb_scale_base": null,
42
+ "rotary_scaling_factor": null,
43
+ "scale_attn_by_inverse_layer_idx": false,
44
+ "scale_attn_weights": true,
45
+ "summary_activation": null,
46
+ "summary_first_dropout": 0.0,
47
+ "summary_proj_to_labels": true,
48
+ "summary_type": "cls_index",
49
+ "summary_use_proj": true,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.48.2",
52
+ "type_vocab_size": 2,
53
+ "use_cache": true,
54
+ "use_flash_attn": true,
55
+ "use_rms_norm": false,
56
+ "use_xentropy": true,
57
+ "vocab_size": 30528
58
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.2",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60eaf75516b03cdf9ef97bf657225afe452d61200008b00e6293171db4af4aa1
3
+ size 546938168
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 8192,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff