---
language:
- en
license: mit
library_name: transformers
tags:
- reranking
- information-retrieval
- pointwise
- binary-cross-entropy
- efficient
- llama
base_model: meta-llama/Llama-3.2-3B
datasets:
- Tevatron/msmarco-passage
- abdoelsayed/DeAR-COT
pipeline_tag: text-classification
---

# DeAR-3B-Reranker-CE-v1

## Model Description

**DeAR-3B-Reranker-CE-v1** is a 3B parameter efficient neural reranker trained with Binary Cross-Entropy loss and knowledge distillation. This model provides fast, reliable reranking for production environments where speed and efficiency are critical.

## Model Details

- **Model Type:** Pointwise Reranker (Binary Classification)
- **Base Model:** LLaMA-3.2-3B
- **Parameters:** 3 billion
- **Training Method:** Knowledge Distillation + Binary Cross-Entropy
- **Teacher Model:** [LLaMA2-13B-RankLLaMA](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher)
- **Training Data:** MS MARCO + DeAR-COT
- **Model Size:** 6GB (BF16)

## Key Features

✅ **Ultra Fast:** 1.5s inference (best in DeAR family)  
✅ **Memory Efficient:** Runs on single 16GB GPU  
✅ **Production Ready:** Stable training with BCE loss  
✅ **Cost Effective:** Lower computational costs  
✅ **Binary Classification:** Probabilistic relevance scores  


## Usage

### Quick Start

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_path = "abdoelsayed/dear-3b-reranker-ce-v1"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)
model.eval().cuda()

# Score a query-document pair
query = "What is llama?"
document = "The llama is a domesticated South American camelid..."

inputs = tokenizer(
    f"query: {query}",
    f"document: {document}",
    return_tensors="pt",
    truncation=True,
    max_length=228,
    padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
    
print(f"Relevance score: {score}")
```

### Efficient Batch Processing

```python
import torch
from typing import List, Tuple

@torch.inference_mode()
def fast_rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 128):
    """Fast reranking optimized for 3B model."""
    device = next(model.parameters()).device
    scores = []
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        
        # Prepare batch
        queries = [f"query: {query}"] * len(batch)
        documents = [f"document: {t} {p}" for t, p in batch]
        
        # Tokenize
        inputs = tokenizer(
            queries,
            documents,
            return_tensors="pt",
            truncation=True,
            max_length=228,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Score
        logits = model(**inputs).logits.squeeze(-1)
        scores.extend(logits.cpu().tolist())
    
    # Rank
    return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)


# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
    ("", "Thomas Edison invented the light bulb in 1879"),
    ("", "Coffee is good for diet"),
    ("", "Lightning strike at Seoul National University"),
]

ranking = fast_rerank(tokenizer, model, query, docs, batch_size=128)
print(ranking)
# DeAR-P-3B-BC Output:
# [(0, -6.0625), (2, -11.125), (1, -12.0625)]
```

### Production Optimization

```python
# Optimize for maximum throughput
model = AutoModelForSequenceClassification.from_pretrained(
    "abdoelsayed/dear-3b-reranker-ce-v1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.eval()

# Compile for 20-30% speedup (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode="max-autotune")

# Use larger batches for throughput
batch_size = 128  # 3B can handle larger batches
```

## Training Details

### Training Configuration
```python
{
    "base_model": "meta-llama/Llama-3.2-3B",
    "teacher_model": "abdoelsayed/llama2-13b-rankllama-teacher",
    "loss": "Binary Cross-Entropy",
    "distillation": {
        "temperature": 2.0,
        "alpha": 0.1
    },
    "learning_rate": 1e-4,
    "batch_size": 4,
    "gradient_accumulation": 2,
    "epochs": 2,
    "max_length": 228,
    "bf16": true
}
```

### Hardware
- **GPUs:** 4x NVIDIA A100 (40GB)
- **Training Time:** ~17 hours
- **Memory Usage:** ~24GB per GPU
- **Trainable Parameters:** 3B

## Evaluation Results

### TREC Deep Learning

| Dataset | NDCG@10 | NDCG@20 | MRR@10 |
|---------|---------|---------|--------|
| DL19 | 70.8 | 67.3 | 83.9 |
| DL20 | 68.9 | 65.8 | 81.7 |

### BEIR Benchmark

| Dataset | NDCG@10 |
|---------|---------|
| MS MARCO | 65.3 |
| NQ | 48.7 |
| HotpotQA | 57.9 |
| FiQA | 43.6 |
| ArguAna | 55.8 |
| SciFact | 70.2 |
| TREC-COVID | 81.8 |
| NFCorpus | 37.2 |
| **Average** | **41.7** |

### Efficiency

| Metric | 3B-CE | 8B-CE | Improvement |
|--------|-------|-------|-------------|
| Inference (100 docs) | 1.5s | 2.2s | **1.5x faster** |
| Throughput | 67 docs/s | 45 docs/s | **1.5x** |
| GPU Memory | 12GB | 18GB | **33% less** |
| Model Size | 6GB | 16GB | **62% smaller** |

## Comparison

### vs. Other 3B Models

| Model | Loss | DL19 | DL20 | Speed (s) |
|-------|------|------|------|-----------|
| **DeAR-3B-CE** | BCE | 70.8 | 68.9 | 1.5 |
| DeAR-3B-RankNet | RankNet | 71.2 | 69.4 | 1.5 |
| MonoT5-3B | - | 71.8 | 68.9 | 3.5 |

**Key Advantages:**
- 2.3x faster than MonoT5-3B
- Comparable accuracy
- More stable training (BCE vs complex losses)

## When to Use

**Best for:**
- ✅ High-throughput production systems
- ✅ Real-time applications (latency <2s)
- ✅ Cost-sensitive deployments
- ✅ Edge deployment (smaller GPUs)
- ✅ Binary relevance tasks

**Consider alternatives for:**
- ❌ Maximum accuracy (use 8B models)
- ❌ Complex reasoning queries (use listwise)
- ❌ Unlimited compute budget

## Deployment Examples

### REST API Server

```python
from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

# Load model once at startup
tokenizer, model = None, None

@app.on_event("startup")
async def load_model():
    global tokenizer, model
    tokenizer = AutoTokenizer.from_pretrained("abdoelsayed/dear-3b-reranker-ce-v1")
    model = AutoModelForSequenceClassification.from_pretrained(
        "abdoelsayed/dear-3b-reranker-ce-v1",
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    model.eval()
    if hasattr(torch, 'compile'):
        model = torch.compile(model)

class RerankRequest(BaseModel):
    query: str
    documents: List[str]

@app.post("/rerank")
async def rerank(request: RerankRequest):
    ranking = fast_rerank(tokenizer, model, request.query, 
                         [(""doc) for doc in request.documents])
    return {"ranking": ranking}
```

### Batch Processing Script

```python
import pandas as pd
from tqdm import tqdm

# Load queries and documents
df = pd.read_csv("queries_docs.csv")

results = []
for _, row in tqdm(df.iterrows()):
    ranking = fast_rerank(tokenizer, model, row['query'], row['documents'])
    results.append({
        'query_id': row['query_id'],
        'ranking': ranking
    })

# Save results
pd.DataFrame(results).to_csv("reranked.csv")
```

## Model Architecture

```
Input: "query: [Q] [SEP] document: [D]"
    ↓
LLaMA-3.2-3B (24 layers, 3072 hidden)
    ↓
[CLS] Token Pooling
    ↓
Linear(3072 → 1)
    ↓
Binary Relevance Score
```

## Limitations

1. **Accuracy:** ~3-4 NDCG@10 lower than 8B models
2. **Complex Queries:** May miss subtle nuances
3. **Document Length:** Limited to 196 tokens
4. **Language:** English only
5. **Domain:** Optimized for web documents

## Related Models

**DeAR 3B Family:**
- [DeAR-3B-RankNet](https://huggingface.co/abdoelsayed/dear-3b-reranker-ranknet-v1) - RankNet variant (slightly better)
- [DeAR-3B-CE-LoRA](https://huggingface.co/abdoelsayed/dear-3b-reranker-ce-lora-v1) - LoRA adapter

**Larger Models:**
- [DeAR-8B-CE](https://huggingface.co/abdoelsayed/dear-8b-reranker-ce-v1) - Higher accuracy

**Resources:**
- [Teacher Model](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher)
- [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT)

## Citation

```bibtex
@article{abdallah2025dear,
  title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
  author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
  journal={arXiv preprint arXiv:2508.16998},
  year={2025}
}
```

## License

MIT License

## More Information

- **GitHub:** [DataScienceUIBK/DeAR-Reranking](https://github.com/DataScienceUIBK/DeAR-Reranking)
- **Paper:** [arXiv:2508.16998](https://arxiv.org/abs/2508.16998)
- **Collection:** [DeAR Models](https://huggingface.co/collections/abdoelsayed/dear-reranking)