--- tags: - visual-document-retrieval - transformers - safetensors - colpali - colqwen - feature-extraction - text - image - multimodal-embedding - vidore - mixture-of-experts - late-interaction - query-conditioned-routing - custom_code license: apache-2.0 base_model: Qwen/Qwen3.5-VL-2B-Instruct library_name: transformers language: - en pipeline_tag: feature-extraction datasets: - vidore/colpali_train_set - llamaindex/vdr-multilingual-train --- # Argus-Colqwen3.5-2b-v0 · bf16 release > **Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval** > University of Innsbruck — Data Science group · 2026 `DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16` is the **bfloat16 merged release** of Argus-Colqwen3.5-2b. It is the *exact same trained network* as the fp32 sibling [`DataScience-UIBK/Argus-Colqwen3.5-2b-v0`](https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0); only the LoRA-merge dtype differs. The bf16 release is **half the disk size** (4.6 GB vs 9.3 GB), faster to download, and easier to deploy on memory-constrained GPUs. | | Disk | V1 avg nDCG@5 | V2 avg nDCG@5 | |---|---:|---:|---:| | fp32 sibling (`-v0`) | 9.3 GB | 0.9149 | 0.6152 | | **this release (`-v0-bf16`)** | **4.6 GB** | **0.9149** | **0.6152** | | Δ vs fp32 | −4.7 GB | 0.0000 | 0.0000 | At the 2B scale the merge-time bf16 cast lands **inside the nDCG@5 measurement noise floor** — every per-task score is bit-identical to the fp32 sibling (see the per-task table below). The bf16 release is therefore the strict winner for almost every deployment scenario at this size. **Use this bf16 release** unless you specifically need fp32 for downstream merging / quantisation chains. ## TL;DR — leaderboard standing - **Strong on the ViDoRe v1 leaderboard at the 2B scale** (V1 = 0.9149) — competitive with nomic-ai/colnomic-embed-multimodal-3b (V1 = 0.916) at **2/3 the parameter count**. - **Best 2B-class result on V2** (V2 = 0.6152), comfortably ahead of vidore/colpali-v1.3 and Metric-AI/colqwen2.5-3b-multilingual. - 2.3 B parameters, 1024-d per-token embedding, ≤ 2048 visual tokens / page — fits on a single **8 GB GPU at bf16 inference**. - **Apache 2.0**, training pipeline trained on public ViDoRe + VDR-Multilingual subsets only. ## What is novel here Most ColPali-style retrievers project every visual token through the same dense head, no matter what the query is. **Argus** replaces that dense head with a sparse mixture in which the gates depend on **both** the visual token and a pooled query summary, so the *same page* gets routed differently for different queries: 1. **Region pooling** — visual tokens are grouped into 4-token regions before routing. 2. **Query-conditioned latent gating (`GateScalars`)** — router input is `region + region_coord_proj(coords) + query_context_proj(pooled_query)`. The query summary makes routing *task-aware*: a financial-numbers query routes through a different expert than a layout query, on the same page. 3. **Sparse top-k=2 of 4 latent specialists**, fused with an always-on shared dense expert via two learnable gating scalars. 4. **Region-aware load balancing** — load balance + KL-uniform + 0.01·router-z² aux losses suppress routing collapse. 5. **3-stage curriculum** — dense baseline (teacher) → MoE balance warmup → joint retrieval with KL distillation. Router sits at backbone layer −5. ## Model details | Property | Value | |---|---| | Base model | [`Qwen/Qwen3.5-VL-2B-Instruct`](https://huggingface.co/Qwen/Qwen3.5-VL-2B-Instruct) | | Total parameters | 2.32 B | | Per-token embedding dim | 1024 | | Max visual tokens / page | 2048 | | Max text tokens | 32 768 | | Similarity function | MaxSim | | MoE specialists | 4 latent + 1 shared dense | | Top-k experts per token | 2 | | Region size | 4 | | Router placement | backbone layer −5 | | Weight precision (this release) | bfloat16 | | Adapted from | `DataScience-UIBK/Argus-Colqwen3.5-2b-v0` (fp32 merge) | | License | Apache 2.0 | | Model size on disk | ~4.6 GB | | VRAM @ bf16 inference | ~5.5 GB | ## Why two dtypes (and why bf16 is essentially free at 2B) Merging a LoRA into the base requires materialising `(α/r)·A·B` and adding it to the base weight matrix. - In **bf16**, both the delta cast and the addition lose precision — the gap is small but irreversible. - In **fp32**, both are exact. For the **4B sibling**, this merge cost shows up as ~0.1 pp on V1 and ~0.2 pp on V2. For the **2B** model it does not — every per-task ViDoRe score is bit-identical to the fp32 sibling. We attribute this to the small 2B model not having "borderline" routing decisions that the bf16 cast can flip; the late-interaction MaxSim pooling further averages out the residual noise across many tokens. So at the 2B scale **the bf16 release is the strict winner**: same scores, half the disk, faster I/O. We still publish fp32 for users who specifically need maximum-precision weights for downstream merging or quantisation. ## Performance — ViDoRe v1 (English, nDCG@5, 10 tasks) Per-task scores measured with `mteb 2.12` on the published bf16 weights, side-by-side with the fp32 sibling and the 4B variants for transparency. | Task | 2B fp32 | **2B bf16 (this)** | 4B fp32 | 4B bf16 | |---|---:|---:|---:|---:| | ArxivQA | 0.9027 | 0.9027 | 0.9095 | 0.9126 | | DocVQA | 0.6747 | 0.6747 | 0.6770 | 0.6779 | | InfoVQA | 0.9497 | **0.9497** | 0.9463 | 0.9447 | | ShiftProject | 0.9133 | 0.9133 | 0.9470 | 0.9346 | | SyntheticDocQA-AI | 0.9963 | 0.9963 | 0.9963 | 0.9926 | | SyntheticDocQA-Energy | 0.9726 | 0.9726 | 0.9789 | 0.9750 | | SyntheticDocQA-Government | 0.9729 | 0.9729 | 0.9779 | 0.9779 | | SyntheticDocQA-Healthcare | 0.9926 | 0.9926 | 0.9963 | 0.9963 | | TabFQuAD | 0.9336 | 0.9336 | 0.9533 | 0.9544 | | TatDQA | 0.8403 | 0.8403 | 0.8480 | 0.8485 | | **Average** | 0.9149 | **0.9149** | 0.9230 | 0.9214 | ### ViDoRe v1 — 2B / 3B-class leaderboard comparison | Rank | Model | Params | dim | V1 avg | |---:|---|---:|---:|---:| | 1 | nomic-ai/colnomic-embed-multimodal-3b | 3.0 B | 128 | 0.916 | | **2** | **Argus-Colqwen3.5-2b-v0-bf16 (this)** | **2.3 B** | **1024** | **0.9149** | | 3 | Metric-AI/colqwen2.5-3b-multilingual | 3.1 B | 128 | 0.892 | | 4 | vidore/colpali-v1.3 | 2.9 B | 128 | 0.844 | Argus matches nomic's 3B-class result at smaller scale and a wider per-token dim, and is the strongest sub-3B retriever published to date. ## Performance — ViDoRe v2 (English, nDCG@5, 4 tasks) | Task | 2B fp32 | **2B bf16 (this)** | 4B fp32 | 4B bf16 | |---|---:|---:|---:|---:| | BioMedicalLectures | 0.6499 | **0.6499** | 0.6438 | 0.6349 | | ESGReports-HighLevel | 0.6936 | 0.6936 | 0.6991 | 0.7079 | | ESGReports | 0.5988 | 0.5988 | 0.6218 | 0.6175 | | EconomicsReports | 0.5186 | 0.5186 | 0.5980 | 0.5918 | | **Average** | 0.6152 | **0.6152** | 0.6407 | 0.6380 | ### ViDoRe v2 — 2B / 3B-class context | Model | V2 avg | |---|---:| | **Argus-Colqwen3.5-2b-v0-bf16** | **0.6152** | | nomic-ai/colnomic-embed-multimodal-3b | 0.616 | | Metric-AI/colqwen2.5-3b-multilingual | 0.580 | ## ViDoRe v3 Not yet evaluated for this release. ## Storage cost | Model | Tokens/page | Dim | Bytes/page (bf16 embeddings) | |---|---:|---:|---:| | Ops-Colqwen3-4B | 1280 | 2560 | 6.6 MB | | Argus-Colqwen3.5-4b-v0-bf16 | 2048 | 1024 | 4.2 MB | | **Argus-Colqwen3.5-2b-v0-bf16** | **2048** | **1024** | **4.2 MB** | | TomoroAI/tomoro-colqwen3-embed-4b | 1280 | 320 | 0.8 MB | Per-page storage is identical between the 2B and 4B Argus releases — both use the same 1024-d head and same 2048-token visual budget. The choice between them is **inference cost** (2B is ~50 % faster than 4B), not corpus storage. ## Installation ```bash pip install "transformers>=5.0.0,<6.0.0" pip install "mteb>=2.12,<3.0.0" pip install -U "transformers>=5.0,<6.0" pip install flash-attn==2.6.3 --no-build-isolation # optional rm -rf ~/.cache/huggingface/modules/transformers_modules ``` ## Usage ```python import torch from PIL import Image from transformers import AutoModel, AutoProcessor MODEL_ID = "DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModel.from_pretrained( MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, # this release ships in bf16 attn_implementation="flash_attention_2", device_map=DEVICE, ).eval() processor = AutoProcessor.from_pretrained( MODEL_ID, trust_remote_code=True, max_num_visual_tokens=2048, ) queries = [ "What is the company's revenue in 2019?", "How does the proposed model compare to baselines?", ] documents = [ Image.open("page_a.png").convert("RGB"), Image.open("page_b.png").convert("RGB"), ] q_emb = model.encode_queries(processor, queries) d_emb = model.encode_images(processor, documents) scores = processor.score(q_emb, d_emb) print(scores) ``` ## Reproduce ViDoRe results with MTEB ```python import mteb m = mteb.get_model("DataScience-UIBK/Argus-Colqwen3.5-2b-v0-bf16") v1 = mteb.get_benchmark("ViDoRe(v1)").tasks v2 = mteb.get_benchmark("ViDoRe(v2)").tasks mteb.MTEB(tasks=v1 + v2).run(m, encode_kwargs={"batch_size": 4}) ``` ## Training Same recipe as the fp32 sibling — see its [card](https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0) for full details. Only difference is the merge dtype. ## When to use 2B-bf16 vs the other variants | Use case | Recommendation | |---|---| | Smallest deployable artefact at the 2B scale | **2B bf16 (this)** ← strict winner: same scores as fp32, half the disk | | Maximum precision at 2B for downstream merging / quantisation | **2B fp32** | | Maximum recall on document QA / leaderboard parity | **4B fp32** | | Latency-sensitive retrieval at the 4B scale | **4B bf16** | ## Limitations - English-dominant. - Merge-time bf16 cast is irreversible — you cannot recover fp32 numbers by upcasting after load. (At 2B this does not matter — the bf16 cast is numerically equivalent to fp32 on ViDoRe; it does matter at 4B.) - ViDoRe v3 not yet evaluated. ## License Apache 2.0, inherited from `Qwen3.5-VL-2B-Instruct`. ## Citation ```bibtex @misc{argus2026, title = {Argus: Region-Aware Query-Conditioned Mixture of Experts for Visual Document Retrieval}, author = {DataScience-UIBK team}, year = {2026}, url = {https://huggingface.co/DataScience-UIBK/Argus-Colqwen3.5-2b-v0}, } ``` ## Contact - Org: [DataScience-UIBK](https://huggingface.co/DataScience-UIBK), University of Innsbruck - Issues: open one on this repo's *Community* tab.