Metis-8B-RL

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Metis-8B-RL is the final RL-trained checkpoint of the Metis framework, trained with Hierarchical Decoupled Policy Optimization (HDPO) on top of Metis-8B-ColdStart. It is a strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning.

[Paper (arXiv)] | [GitHub] | [ColdStart Model] | [RL Data] | [ColdStart Data]

Highlights

98% → 2% Tool Calls — Reduces blind tool invocation by orders of magnitude.
SOTA Performance — Best accuracy across 13 benchmarks among open-source 8B agentic models.
Meta-Cognitive Wisdom — Learns when to use tools, not just how.

Model Details

Attribute	Value
Base model	Qwen3-VL-8B-Instruct
SFT checkpoint	Metis-8B-ColdStart
RL algorithm	HDPO (Hierarchical Decoupled Policy Optimization)
Training data	Metis-RL (~5K prompts)
License	Apache-2.0

HDPO Training Hyperparameters

Hyperparameter	Value
Batch size	128
Rollouts per prompt (G)	16
Learning rate	1e-6
KL coefficient	0
Loss weights	w_acc = 1.0, w_tool = 0.15
Max response length	16,384 tokens

Method: Hierarchical Decoupled Policy Optimization (HDPO)

Current agentic multimodal models suffer from blind tool invocation — they reflexively call external tools even when queries are directly resolvable from the visual context. Existing RL methods attempt to fix this by coupling accuracy and tool-efficiency into a single scalar reward, but this creates an irreconcilable optimization dilemma.

HDPO resolves this through three key components:

Dual Reward Design — An accuracy reward (r_acc) and a tool-efficiency reward (r_tool) that is conditioned on correctness.
Decoupled Advantage Estimation — Accuracy advantages are computed over all rollouts; tool efficiency advantages are computed exclusively over correct rollouts (conditional GRPO).
Hierarchical Policy Update — Two independent clipped surrogate losses combined as L_HDPO = w_acc · L_GRPO(A_acc) + w_tool · L_GRPO(A_tool).

This naturally induces an implicit curriculum: first learn to be correct, then learn to be efficient.

Evaluation Results

Perception and Document Understanding

Model	V*Bench	HR4K	HR8K	TreeBench	MME-RW	SEED2+	CharXiv(DQ)	CharXiv(RQ)
Qwen3-VL-8B-Instruct	86.4	78.9	74.6	40.7	61.9	71.0	83.0	46.3
DeepEyesV2	81.8	77.9	73.8	42.5	64.9	70.5	78.6	48.9
SenseNova-MARS-8B	92.2	83.1	78.4	-	67.9	-	-	-
Skywork-R1V4-30B-A3B	88.0	82.8	79.8	-	71.4	-	-	-
Metis (Ours)	91.1	83.5	82.0	45.2	70.3	72.5	83.4	54.1

Mathematical and Logical Reasoning

Model	MathVista	MathVerse	WeMath	DynaMath	LogicVista	Avg.
Qwen3-VL-8B-Instruct	76.3	61.3	38.8	65.5	54.9	59.4
DeepEyesV2	71.9	52.7	38.1	57.2	48.7	53.7
Metis (Ours)	78.0	65.9	65.2	69.2	56.2	66.9

Usage

Please refer to the GitHub repository for full installation and inference instructions.

Installation

git clone https://github.com/Accio-Lab/Metis.git
cd Metis
pip install -e verl
pip install -e ".[vllm,search_tool,python_code_dep]"

Citation

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}