Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use llm-wizard/legal-ft with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("llm-wizard/legal-ft")
sentences = [
"What actions did Mr and Mrs Harris take that led to the revelation of the facts in the case?",
"Perplexity’s marketing activities include promoting on its Instagram account a massive billboard \nin Times Square from September 2024 which read “Congratulations Perplexity on 250 million \nquestions answered last month.”5 \n \n4 Discover New York with Perplexity, Perplexity AI (last visited Oct. 17, 2024), \nhttps://www.perplexity.ai/encyclopedia/discovernewyork. \n5 @perplexity.ai, Instagram (Sept. 4, 2024), \nhttps://www.instagram.com/perplexity.ai/p/C_g2TonSHC5. \nCase 1:24-cv-07984 Document 1 Filed 10/21/24 Page 8 of 42",
"31 \n \nstatus. It was not until Mr. and Mrs. Harris retained counsel, served a demand letter on May 22, \n2024, met with the then Assistant Superintendent and a lengthy “bulling investigation” that these \nfacts came to light. \nThe Defendant’s actions and conduct, by definition, was arbitrary and capricious as was \nthe imposition of discipline that was a gross abuse of discretion when it served as a catalyst for \nthis action. Similarly, the Defendants exceeded their authority by repeatedly doubling down on \ntheir acts and conduct when given the opportunity to reverse course. The adverse action taken was \nnot based on sound, objective, adopted and approved policies and procedures regarding the use of",
"website users, and licensing is transacted with individuals and entities residing in this State and \nDistrict. As such, the injuries alleged herein from Perplexity’s infringement and other unlawful \nconduct foreseeably occurred in this State and District. In addition, Perplexity or its agents reside \nin this District and may be found in this State and District. \n23. \nDefendant Perplexity is subject to the jurisdiction of this Court pursuant to N.Y. \nC.P.L.R. § 302(a)(1) and (3) as it has purposefully directed its activities at New York and has \nCase 1:24-cv-07984 Document 1 Filed 10/21/24 Page 7 of 42"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("llm-wizard/legal-ft")
# Run inference
sentences = [
'What challenges do professional journalists and publishers face that may impact their ability to enforce their intellectual property rights?',
'19 \nrespect. They feel very good about it. And in our user interface, even though we give the answer, \nwe do show the user exactly where the answer is coming from.”16 \n68. \nAs Srinivas surely knows or should know, academic standards for avoiding \nplagiarism are wholly independent from copyright law.17 Dow Jones and NYP Holdings editors \nand journalists are not graduate students working out of a library or lab, eager to have someone \nacknowledge and utilize their research. They are professional journalists and publishers – working \nunder high-pressure deadlines, sometimes in dangerous places – whose livelihoods depend on the \nenforcement and monetization of their intellectual property rights. \n69.',
'ban or prohibition on the use of AI by students. The Defendants were not trained on any policies \nor procedures for use of AI alone, never mind what they were “able to do” to students who used \nit. The entire purpose behind having such policies and procedures in place is to ensure notice, \nequity, fairness and to be sure: a level playing field for all. Making matters worse, there exists \nno adequate procedures and policies for the induction of an applicant into NHS when compared to \nother members who are inducted despite the same or similar infractions. This is a denial of student \nrights of the highest order. \n \nIn the case here, RNH was disciplined on an ad hoc and on-going basis over more than six',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.7292 |
| cosine_accuracy@3 | 0.8542 |
| cosine_accuracy@5 | 0.9375 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.7292 |
| cosine_precision@3 | 0.2847 |
| cosine_precision@5 | 0.1875 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.7292 |
| cosine_recall@3 | 0.8542 |
| cosine_recall@5 | 0.9375 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.8576 |
| cosine_mrr@10 | 0.8125 |
| cosine_map@100 | 0.8125 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What provisions of the 2023-2024 Handbook were referenced regarding the use of AI and academic integrity? |
13 |
How is plagiarism defined in the context provided? |
13 |
What is the case number associated with the document filed on 10/21/24? |
program-ad-revenue-sharing-ai-time-fortune-der-spiegel. |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 10per_device_eval_batch_size: 10num_train_epochs: 10multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 10per_device_eval_batch_size: 10per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 10max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_ndcg@10 |
|---|---|---|
| 1.0 | 40 | 0.8182 |
| 1.25 | 50 | 0.8172 |
| 2.0 | 80 | 0.8112 |
| 2.5 | 100 | 0.8414 |
| 3.0 | 120 | 0.8236 |
| 3.75 | 150 | 0.7962 |
| 4.0 | 160 | 0.7930 |
| 5.0 | 200 | 0.8536 |
| 6.0 | 240 | 0.8263 |
| 6.25 | 250 | 0.8257 |
| 7.0 | 280 | 0.8475 |
| 7.5 | 300 | 0.8505 |
| 8.0 | 320 | 0.8499 |
| 8.75 | 350 | 0.8582 |
| 9.0 | 360 | 0.8576 |
| 10.0 | 400 | 0.8576 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-l