V-Thinker: Interactive thinking with images

ByRunqi Qiao·Qiuna Tan·Minghan Yang·Guanting Dong·Peiqing Yang·Shiqiang Lang+4 more

Original abstract (expand)

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising"Thinking with Images"paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Characterizing Datasets for Social Visual Question Answering, and the New TinySocial Dataset
Shiyao Li, Roxanne Rashedi, Xiaoman Zi, Morgan Elrod-Erickson, Bryan Hollis, Angela Maliakal, Xinyu Shen, Simeng Zhao, Maithilee Kunda Zhanwen Chen
2020
≈ 67%
Visual Theory of Mind Enables the Invention of Proto-Writing
Lucas Gelfond, George Konidaris Benjamin A. Spiegel
2025
≈ 67%
Pathdreamer: A World Model for Indoor Navigation
Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson Jing Yu Koh
2021
≈ 67%
PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding
Zhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang Lirong Che
2026
≈ 67%
Embodied AI Agents: Modeling the World
Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Herv\'e J\'egou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Th\'eo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, Jitendra Malik Pascale Fung
2025
≈ 66%
Zero-Shot Textual Explanations via Translating Decision-Critical Features
Hiroshi Kera, Kazuhiko Kawamoto Toshinori Yamauchi
2025
≈ 66%
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu Zhiwei Ning
2026
≈ 66%
VISTAv2: World Imagination for Indoor Vision-and-Language Navigation
Xianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu Yanjia Huang
2025
≈ 66%
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo Shuang Zeng
2025
≈ 66%
Do multimodal models imagine electric sheep?
Carl Vondrick, Raja Giryes, Philipp Kr\"ahenb\"uhl, Vladlen Koltun Santhosh Kumar Ramakrishnan
2026
≈ 66%
Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog
Yu-Jung Heo, Byoung-Tak Zhang Sang-Woo Lee
2018
≈ 66%
How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark
Mallika Mainali, Anik Sen Ximing Wen
2025
≈ 66%
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
Qing Lian, Jinghang Li and Qing Jiang and Tianming Zhang and Xiaoke Jiang and Chuanxiu Liu and Jie Liu and Lei Zhang Yiran Ling
2026
≈ 65%
HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind
Rahath Malladi, Arshia Sangwan, David Danks, Tauhidur Rahman Nigel Doering
2026
≈ 65%
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan Juhong Min
2026
≈ 65%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 62%
Interpreting Language Model Parameters
in corpus
2026
≈ 61%
Emergent Introspective Awareness in Large Language Models
in corpus
2026
≈ 61%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 60%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 60%
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
in corpus
≈ 60%
The computational boundary of a 'self': developmental bioelectricity drives multicellularity and scale-free cognition
in corpus
2019
≈ 60%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
in corpus
2024
≈ 59%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 59%
AI as a Buddhist Self-Overcoming Technique in Another Medium
in corpus
2025
≈ 59%
Sharing the World with Digital Minds
in corpus
≈ 59%
Culture and the Arts: From Art Worlds to Arts-in-Action
in corpus
2020
≈ 59%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 59%

Similar preprints — Semantic Scholar

Cited by (1)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS resolves a core trade-off in visual reasoning by introducing functional tokens — single discrete 'words' that simultaneously serve as agentic operations and latent visual reasoning units, elimin