Papers
arxiv:2604.02870

Token Warping Helps MLLMs Look from Nearby Viewpoints

Published on Apr 3
· Submitted by
Yuseung "Phillip" Lee
on Apr 6
#3 Paper of the day
Authors:
,
,
,
,
,

Abstract

Token-level warping in vision-language models demonstrates superior stability and semantic coherence for viewpoint transformation compared to pixel-wise methods, achieving better visual reasoning performance.

AI-generated summary

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.02870
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.02870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.02870 in a Space README.md to link it from this page.

Collections including this paper 1