---
tags:
  - visual-document-retrieval
  - transformers
  - safetensors
  - colpali
  - colqwen
  - feature-extraction
  - text
  - image
  - multimodal-embedding
  - vidore
  - mixture-of-experts
  - late-interaction
  - query-conditioned-routing
  - custom_code
license: apache-2.0
base_model: Qwen/Qwen3.5-VL-2B-Instruct
library_name: transformers
language:
  - en
pipeline_tag: feature-extraction
datasets:
  - vidore/colpali_train_set
  - llamaindex/vdr-multilingual-train
---

# Argus-Colqwen3.5-2b-v0  ·  bf16 release

> **Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval**
> University of Innsbruck — Data Science group · 2026

`DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16` is the **bfloat16 merged release** of Argus-Colqwen3.5-2b. It is the *exact same trained network* as the fp32 sibling [`DataScience-UIBK/Argus-Colqwen3.5-2b-v0`](https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0); only the LoRA-merge dtype differs.

The bf16 release is **half the disk size** (4.6 GB vs 9.3 GB), faster to download, and easier to deploy on memory-constrained GPUs.

| | Disk | V1 avg nDCG@5 | V2 avg nDCG@5 |
|---|---:|---:|---:|
| fp32 sibling (`-v0`) | 9.3 GB | 0.9149 | 0.6152 |
| **this release (`-v0-bf16`)** | **4.6 GB** | **0.9149** | **0.6152** |
| Δ vs fp32 | −4.7 GB | 0.0000 | 0.0000 |

At the 2B scale the merge-time bf16 cast lands **inside the nDCG@5 measurement noise floor** — every per-task score is bit-identical to the fp32 sibling (see the per-task table below). The bf16 release is therefore the strict winner for almost every deployment scenario at this size.

**Use this bf16 release** unless you specifically need fp32 for downstream merging / quantisation chains.

## TL;DR — leaderboard standing

- **Strong on the ViDoRe v1 leaderboard at the 2B scale** (V1 = 0.9149) — competitive with nomic-ai/colnomic-embed-multimodal-3b (V1 = 0.916) at **2/3 the parameter count**.
- **Best 2B-class result on V2** (V2 = 0.6152), comfortably ahead of vidore/colpali-v1.3 and Metric-AI/colqwen2.5-3b-multilingual.
- 2.3 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single **8 GB GPU at bf16 inference**.
- **Apache 2.0**, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only.

## What is novel here

Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. **Argus** replaces that dense head with a sparse mixture in which the gates depend on **both** the visual token and a pooled query summary, so the *same page* gets routed differently for different queries:

1. **Region pooling** — visual tokens are grouped into 4-token regions before routing.
2. **Query-conditioned latent gating (`GateScalars`)** — router input is `region + region_coord_proj(coords) + query_context_proj(pooled_query)`. The query summary makes routing *task-aware*: a financial-numbers query routes through a different expert than a layout query, on the same page.
3. **Sparse top-k=2 of 4 latent specialists**, fused with an always-on shared dense expert via two learnable gating scalars.
4. **Region-aware load balancing** — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse.
5. **3-stage curriculum** — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation.

Router sits at backbone layer −5.

## Model details

| Property | Value |
|---|---|
| Base model | [`Qwen/Qwen3.5-VL-2B-Instruct`](https://huggingface.co/Qwen/Qwen3.5-VL-2B-Instruct) |
| Total parameters | 2.32 B |
| Per-token embedding dim | 1024 |
| Max visual tokens / page | 2048 |
| Max text tokens | 32 768 |
| Similarity function | MaxSim |
| MoE specialists | 4 latent + 1 shared dense |
| Top-k experts per token | 2 |
| Region size | 4 |
| Router placement | backbone layer −5 |
| Weight precision (this release) | bfloat16 |
| Adapted from | `DataScience-UIBK/Argus-Colqwen3.5-2b-v0` (fp32 merge) |
| License | Apache 2.0 |
| Model size on disk | ~4.6 GB |
| VRAM @ bf16 inference | ~5.5 GB |

## Why two dtypes (and why bf16 is essentially free at 2B)

Merging a LoRA into the base requires materialising `(α/r)·A·B` and adding it to the base weight matrix.

- In **bf16**, both the delta cast and the addition lose precision — the gap is small but irreversible.
- In **fp32**, both are exact.

For the **4B sibling**, this merge cost shows up as ~0.1 pp on V1 and ~0.2 pp on V2. For the **2B** model it does not — every per-task ViDoRe score is bit-identical to the fp32 sibling. We attribute this to the small 2B model not having "borderline" routing decisions that the bf16 cast can flip; the late-interaction MaxSim pooling further averages out the residual noise across many tokens.

So at the 2B scale **the bf16 release is the strict winner**: same scores, half the disk, faster I/O. We still publish fp32 for users who specifically need maximum-precision weights for downstream merging or quantisation.

## Performance — ViDoRe v1 (English, nDCG@5, 10 tasks)

Per-task scores measured with `mteb 2.12` on the published bf16 weights, side-by-side with the fp32 sibling and the 4B variants for transparency.

| Task | 2B fp32 | **2B bf16 (this)** | 4B fp32 | 4B bf16 |
|---|---:|---:|---:|---:|
| ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 |
| DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 |
| InfoVQA | 0.9497 | **0.9497** | 0.9463 | 0.9447 |
| ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 |
| SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 |
| SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 |
| SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 |
| SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 |
| TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 |
| TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 |
| **Average** | 0.9149 | **0.9149** | 0.9230 | 0.9214 |

### ViDoRe v1 — 2B / 3B-class leaderboard comparison

| Rank | Model | Params | dim | V1 avg |
|---:|---|---:|---:|---:|
| 1 | nomic-ai/colnomic-embed-multimodal-3b | 3.0 B | 128 | 0.916 |
| **2** | **Argus-Colqwen3.5-2b-v0-bf16 (this)** | **2.3 B** | **1024** | **0.9149** |
| 3 | Metric-AI/colqwen2.5-3b-multilingual | 3.1 B | 128 | 0.892 |
| 4 | vidore/colpali-v1.3 | 2.9 B | 128 | 0.844 |

Argus matches nomic's 3B-class result at smaller scale and a wider per-token dim, and is the strongest sub-3B retriever published to date.

## Performance — ViDoRe v2 (English, nDCG@5, 4 tasks)

| Task | 2B fp32 | **2B bf16 (this)** | 4B fp32 | 4B bf16 |
|---|---:|---:|---:|---:|
| BioMedicalLectures | 0.6499 | **0.6499** | 0.6438 | 0.6349 |
| ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 |
| ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 |
| EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 |
| **Average** | 0.6152 | **0.6152** | 0.6407 | 0.6380 |

### ViDoRe v2 — 2B / 3B-class context

| Model | V2 avg |
|---|---:|
| **Argus-Colqwen3.5-2b-v0-bf16** | **0.6152** |
| nomic-ai/colnomic-embed-multimodal-3b | 0.616 |
| Metric-AI/colqwen2.5-3b-multilingual | 0.580 |

## ViDoRe v3

Not yet evaluated for this release.

## Storage cost

| Model | Tokens/page | Dim | Bytes/page (bf16 embeddings) |
|---|---:|---:|---:|
| Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB |
| Argus-Colqwen3.5-4b-v0-bf16 | 2048 | 1024 | 4.2 MB |
| **Argus-Colqwen3.5-2b-v0-bf16** | **2048** | **1024** | **4.2 MB** |
| TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB |

Per-page storage is identical between the 2B and 4B Argus releases — both use the same 1024-d head and same 2048-token visual budget. The choice between them is **inference cost** (2B is ~50 % faster than 4B), not corpus storage.

## Installation

```bash
pip install "transformers>=5.0.0,<6.0.0"
pip install "mteb>=2.12,<3.0.0"
pip install -U "transformers>=5.0,<6.0"
pip install flash-attn==2.6.3 --no-build-isolation     # optional
rm -rf ~/.cache/huggingface/modules/transformers_modules
```

## Usage

```python
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,                  # this release ships in bf16
    attn_implementation="flash_attention_2",
    device_map=DEVICE,
).eval()

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=2048,
)

queries = [
    "What is the company's revenue in 2019?",
    "How does the proposed model compare to baselines?",
]
documents = [
    Image.open("page_a.png").convert("RGB"),
    Image.open("page_b.png").convert("RGB"),
]

q_emb  = model.encode_queries(processor, queries)
d_emb  = model.encode_images(processor, documents)
scores = processor.score(q_emb, d_emb)
print(scores)
```

## Reproduce ViDoRe results with MTEB

```python
import mteb

m  = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16")
v1 = mteb.get_benchmark("ViDoRe(v1)").tasks
v2 = mteb.get_benchmark("ViDoRe(v2)").tasks
mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4})
```

## Training

Same recipe as the fp32 sibling — see its [card](https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0) for full details. Only difference is the merge dtype.

## When to use 2B-bf16 vs the other variants

| Use case | Recommendation |
|---|---|
| Smallest deployable artefact at the 2B scale | **2B bf16 (this)** ← strict winner: same scores as fp32, half the disk |
| Maximum precision at 2B for downstream merging / quantisation | **2B fp32** |
| Maximum recall on document QA / leaderboard parity | **4B fp32** |
| Latency-sensitive retrieval at the 4B scale | **4B bf16** |

## Limitations

- English-dominant.
- Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load. (At 2B this does not matter — the bf16 cast is numerically equivalent to fp32 on ViDoRe; it does matter at 4B.)
- ViDoRe v3 not yet evaluated.

## License

Apache 2.0, inherited from `Qwen3.5-VL-2B-Instruct`.

## Citation

```bibtex
@misc{argus2026,
  title  = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval},
  author = {DataScience-UIBK team},
  year   = {2026},
  url    = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0},
}
```

## Contact

- Org: [DataScience-UIBK](https://huggingface.co/DataScience-UIBK), University of Innsbruck
- Issues: open one on this repo's *Community* tab.