Papers
arxiv:2601.21037

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Published on Jan 28
Β· Submitted by
Chengzu Li
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Video generation models demonstrate robust zero-shot generalization for visual reasoning tasks through explicit visual context utilization and test-time scaling capabilities.

AI-generated summary

Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.

Community

Paper submitter

arXivLens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/thinking-in-frames-how-visual-context-and-test-time-scaling-empower-video-reasoning-5159-f056e860

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

arXivLens breakdown of this paper πŸ‘‰ https://arxivlens.com/PaperView/Details/thinking-in-frames-how-visual-context-and-test-time-scaling-empower-video-reasoning-5159-f056e860

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21037 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21037 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21037 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.