Euler-Legal-Embedding-V1

---
tags:
- mteb
- sentence-transformers
- transformers
- qwen
- feature-extraction
- text-classification
- text-clustering
- text-retrieval
- text-reranking
- text-pair-classification
- text-multilabel-classification
- text-bitext-mining
library_name: sentence-transformers
base_model: Qwen/Qwen3-Embedding-8B
license: apache-2.0
language:
- en
- multilingual
extra_gated_eu_disallowed: true
---

<h1 align="center">Euler-Legal-Embedding-V1</h1>
<p align="center">
  <a href="https://huggingface.co/Mira190/Euler-Legal-Embedding-V1">
    <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
  </a>
</p>

## Short Description
Euler-Legal-Embedding-V1 is a specialized embedding model for the legal domain, fine-tuned on [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B). It achieves strong performance on legal retrieval and reasoning tasks within the MTEB benchmark.

## Model Details
- **Base Model**: Qwen/Qwen3-Embedding-8B
- **Model Size**: ~8B
- **Embedding Dimension**: 4096 (Default for Qwen3-8B)
- **Max Input Tokens**: 1536
- **Pooling**: Last token pooling (Standard for Qwen-Embedding)
- **Training Data**: Legal domain specific dataset (`final-data-new-anonymized-grok4-filtered.jsonl`)

## Usage

### sentence-transformers support

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```bash
pip install -U sentence-transformers
```

You can use the model like this:

```python
from sentence_transformers import SentenceTransformer
import torch

# Load the model
# trust_remote_code=True is required for Qwen-based models
model = SentenceTransformer(
    "Mira190/Euler-Legal-Embedding-V1",
    trust_remote_code=True,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2",  # Optional, requires flash-attn installed
    },
)

model.max_seq_length = 1536

sentences = [
    "The plaintiff filed a motion for summary judgment.",
    "The court granted the motion based on lack of genuine dispute of material fact."
]

# No specific prompt is required for this version
embeddings = model.encode(
    sentences,
    normalize_embeddings=True,
    batch_size=16,
    show_progress_bar=True,
)

print(embeddings.shape)
# Output: (2, 4096)
```

### Transformers support

You can also use the model directly with the `transformers` library:

```python
import torch
from transformers import AutoModel, AutoTokenizer

model_id = "Mira190/Euler-Legal-Embedding-V1"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

sentences = ["This is a legal document.", "This is another legal document."]

# Tokenize sentences
inputs = tokenizer(
    sentences, 
    return_tensors="pt", 
    padding=True, 
    truncation=True, 
    max_length=1536
)

# Move inputs to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    # Last token pooling (Standard for Qwen-Embedding)
    # Note: Qwen embeddings typically use the last hidden state of the last token (EOS or specific token)
    embeddings = outputs.last_hidden_state[:, -1]
    
    # Normalize embeddings
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

print(embeddings.shape)
# Output: (2, 4096)
```

## Training Details
The model was fine-tuned using LoRA (Low-Rank Adaptation) via the Swift framework.

- **Framework**: Swift
- **Loss Function**: InfoNCE (Temperature: 0.03)
- **Batch Size**: 4 (per device)
- **Learning Rate**: 2e-5
- **LoRA Config**: Rank 8, Alpha 32, Dropout 0.05

## Citation
If you find this model useful, please consider citing:

```bibtex
@misc{euler2025legal,
      title={Euler-Legal-Embedding: Advanced Legal Representation Learning}, 
      author={LawRank Team},
      year={2025},
      publisher={Hugging Face}
}
```