jina-vlm: Small Multilingual Vision Language Model

Blog | API | AWS | Azure | GCP | Arxiv

jina-vlm is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Training data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning high- and moderate-resource languages.

Built on Qwen3-1.7B-Base with SigLIP2-So400M, it processes images via overlapping tiling with attention-based token pooling that reduces visual tokens by 4x while preserving spatial information. The model achieves the highest average score (72.3) across eight VQA benchmarks while leading on multilingual multimodal understanding (MMMB: 78.8, Multilingual MMBench: 74.3).

Model	Params	VQA Avg	MMMB	MM-Bench	RealWorld QA
jina-vlm	2.4B	72.3	78.8	74.3	68.2
Qwen2-VL-2B	2.2B	66.4	71.3	69.4	62.9
Qwen3-VL-2B	2.2B	71.6	75.0	72.3	63.9
InternVL3-2B	2.2B	69.2	73.6	71.9	64.3
InternVL3.5-2B	2.2B	71.6	74.6	70.9	62.0

Via Jina API

We provide an OpenAI-compatible API at https://api-beta-vlm.jina.ai. All requests require a Jina API key in the Authorization header, get your API key at jina.ai.

Image from URL

Format	Example
HTTP/HTTPS URL	`https://example.com/image.jpg`
Base64 data URI	`data:image/jpeg;base64,/9j/4AAQ...`

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Local image (base64)

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$(base64 -i image.jpg)'"}}
      ]
    }]
  }'

Text-only query

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Streaming response

Add "stream": true to receive tokens as they're generated:

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "stream": true,
    "messages": [{"role": "user", "content": "Write a haiku about coding"}]
  }'

When the service is cold starting, you'll receive:

{
  "error": {
    "message": "Model is loading, please retry in 30-60 seconds. Cold start takes ~30s after the service scales up.",
    "code": 503
  }
}

Simply retry your request after waiting.

Local Installation

uv sync

For CUDA users with FlashAttention2 support:

uv sync --extra flash-attn

Using the CLI

You can directly chat with jina-vlm using the infer.py CLI:

# Single image
python infer.py -i image.jpg -p "What's in this image?"

# Streaming output
python infer.py -i image.jpg -p "Describe this image" --stream

# Multiple images
python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"

# Text-only
python infer.py -p "What is the capital of France?"

Options:

-m, --model: Model path. Auto-detects local repo (if config.json exists) or falls back to jinaai/jina-vlm from HuggingFace.
-i, --image: Image path, URL, or glob pattern (can specify multiple times).
-p, --prompt: Text prompt (can specify multiple times).
--max-crops: Maximum crops (default: 12).
--max-tokens: Maximum output tokens (default: 1024).
--max-pixels: Max pixels per image, larger images are resized preserving aspect ratio.
--stream: Enable streaming output.

Example:

python infer.py -i assets/the_persistence_of_memory.jpg -p "Describe this picture"

Input Output

* Conversation 1/1
├── 🖼️Images: ['the_persistence_of_memory.jpg']
├── 📜Prompt: Describe this picture
└── 🧠Response: This image is a surreal painting
by Salvador Dalí, titled "The Persistence of
Memory." It features a dreamlike landscape with
a variety of melting clocks and other objects.
The central focus is a melting clock with a blue
face and yellow hands, which is hanging from a
branch...

Token usage: 1753 tokens (4.3%)
Generated in 8.68s | 20.04 tok/s

Using Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    trust_remote_code=True
)

image = 'https://picsum.photos/800/600'
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'image': image},
            {'type': 'text', 'text': 'Describe this image'},
        ],
    }
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)

Multi-image inference

images = ['https://picsum.photos/id/1/800/600', 'https://picsum.photos/id/2/800/600']
conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'image', 'image': images[0]},
            {'type': 'image', 'image': images[1]},
            {'type': 'text', 'text': 'What is the difference between these images?'},
        ],
    }
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)
response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)

Text-only inference

conversation = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'Explain quantum computing in simple terms'},
        ],
    }
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text], padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)
response = processor.tokenizer.decode(
    output.sequences[0][inputs['input_ids'].shape[-1]:],
    skip_special_tokens=True
)
print(response)

Batch inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    torch_dtype=torch.bfloat16,
    attn_implementation='flash_attention_2',
    trust_remote_code=True
)

images = [
    'https://picsum.photos/id/22/800/600',
    'https://picsum.photos/id/49/800/600'
]
conversations = [
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[0]},
                {'type': 'text', 'text': 'What is the man doing in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[1]},
                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
            ],
        }
    ],
]

texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

for idx in range(len(output.sequences)):
    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
    print(f"Response {idx+1}: {response}")

Batch inference with mixed examples

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

processor = AutoProcessor.from_pretrained(
    'jinaai/jina-vlm', use_fast=False, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'jinaai/jina-vlm',
    device_map='auto',
    torch_dtype=torch.bfloat16,
    attn_implementation='flash_attention_2',
    trust_remote_code=True
)

images = [
    ['https://picsum.photos/id/22/800/600'],
    ['https://picsum.photos/id/49/800/600'],
    ['https://picsum.photos/id/0/800/600', 'https://picsum.photos/id/2/800/600'],
    [],
]
conversations = [
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[0][0]},
                {'type': 'text', 'text': 'What is the man doing in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[1][0]},
                {'type': 'text', 'text': 'What country\'s flag is in this image?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'image', 'image': images[2][0]},
                {'type': 'image', 'image': images[2][1]},
                {'type': 'text', 'text': 'What is the difference between these two images?'},
            ],
        }
    ],
    [
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'Describe the concept of polymorphism in Computer Science'},
            ],
        }
    ],
]

texts = processor.apply_chat_template(conversations, add_generation_prompt=True)
inputs = processor(text=texts, images=images, padding='longest', return_tensors='pt')
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

output = model.generate(
    **inputs,
    generation_config=GenerationConfig(max_new_tokens=512, do_sample=False),
    return_dict_in_generate=True,
    use_model_defaults=True,
)

for idx in range(len(output.sequences)):
    gen_ids = output.sequences[idx][inputs['input_ids'].shape[-1]:]
    response = processor.tokenizer.decode(gen_ids, skip_special_tokens=True)
    print(f"Response {idx+1}: {response}")

Evaluation

Multilingual Understanding

Model	MMMB ar	MMMB cn	MMMB en	MMMB avg	MMBench avg	Overall
jina-vlm	76.9	80.0	82.0	78.8	74.3	59.6
Qwen2-VL-2B	68.3	74.2	78.3	71.3	69.4	53.8
Qwen3-VL-2B	72.7	75.7	80.7	75.0	72.3	58.2
InternVL3-2B	68.6	78.3	81.9	73.6	71.9	57.4
InternVL3.5-2B	68.5	77.7	80.2	74.6	70.9	58.0

General VQA Tasks

Model	AI2D	ChartQA	TextVQA	DocVQA	InfoVQA	OCRBench	SEED-2+	CharXiv	Avg
jina-vlm	82.0	81.9	83.2	90.6	71.6	778	67.2	32.3/63.5	72.3
Qwen2-VL-2B	74.7	73.5	79.7	89.2	64.0	809	62.4	23.3/55.0	66.4
Qwen3-VL-2B	76.9	77.2	79.5	92.3	71.9	858	67.3	28.8/62.3	71.6
InternVL3-2B	78.6	80.2	77.0	87.4	67.1	835	64.6	28.3/54.7	69.2
InternVL3.5-2B	78.8	80.7	76.5	88.5	69.3	836	68.0	31.6/65.0	71.6

Text-Only Performance

Model	MMLU	MMLU-Pro	GSM-8K	ARC-C	HellaSwag
jina-vlm	56.1	30.3	71.3	77.3	59.4
Qwen3-1.7B	62.6	46.4	75.3	73.4	59.0

Citation

If you find jina-vlm useful in your research, please cite our technical report:

@misc{koukounas2025jinavlm,
    title={Jina-VLM: Small Multilingual Vision Language Model},
    author={Andreas Koukounas and Georgios Mastrapas and Florian Hönicke and Sedigheh Eslami and Guillaume Roncari and Scott Martens and Han Xiao},
    year={2025},
    eprint={2512.04032},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2512.04032},
}

License

jina-vlm is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us.

Downloads last month: 797

Safetensors

Model size

2B params

Tensor type

F32

Model tree for jinaai/jina-vlm

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(208)

this model

Collection including jinaai/jina-vlm

jina-vlm

Collection

Jina-VLM: Small Multilingual Vision Language Model • 2 items • Updated 2 days ago • 3