paper:arxiv-2511-04460V-Thinker: Interactive thinking with images
Original abstract (expand)
Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising"Thinking with Images"paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Characterizing Datasets for Social Visual Question Answering, and the New TinySocial DatasetShiyao Li, Roxanne Rashedi, Xiaoman Zi, Morgan Elrod-Erickson, Bryan Hollis, Angela Maliakal, Xinyu Shen, Simeng Zhao, Maithilee Kunda Zhanwen Chen2020≈ 67%
- Visual Theory of Mind Enables the Invention of Proto-WritingLucas Gelfond, George Konidaris Benjamin A. Spiegel2025≈ 67%
- Pathdreamer: A World Model for Indoor NavigationHonglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson Jing Yu Koh2021≈ 67%
- PhotoAgent: A Robotic Photographer with Spatial and Aesthetic UnderstandingZhenfeng Gan, Yanbo Chen, Junbo Tan, Xueqian Wang Lirong Che2026≈ 67%
- Embodied AI Agents: Modeling the WorldYoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Herv\'e J\'egou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Th\'eo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, Jitendra Malik Pascale Fung2025≈ 66%
- Zero-Shot Textual Explanations via Translating Decision-Critical FeaturesHiroshi Kera, Kazuhiko Kawamoto Toshinori Yamauchi2025≈ 66%
- V-ABS: Action-Observer Driven Beam Search for Dynamic Visual ReasoningXuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu Zhiwei Ning2026≈ 66%
- VISTAv2: World Imagination for Indoor Vision-and-Language NavigationXianshun Jiang, Xiangbo Gao, Mingyang Wu, Zhengzhong Tu Yanjia Huang2025≈ 66%
- FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous DrivingXinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo Shuang Zeng2025≈ 66%
- Do multimodal models imagine electric sheep?Carl Vondrick, Raja Giryes, Philipp Kr\"ahenb\"uhl, Vladlen Koltun Santhosh Kumar Ramakrishnan2026≈ 66%
- Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual DialogYu-Jung Heo, Byoung-Tak Zhang Sang-Woo Lee2018≈ 66%
- How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation BenchmarkMallika Mainali, Anik Sen Ximing Wen2025≈ 66%
- Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action ModelsQing Lian, Jinghang Li and Qing Jiang and Tianming Zhang and Xiaoke Jiang and Chuanxiu Liu and Jie Liu and Lei Zhang Yiran Ling2026≈ 65%
- HiVAE: Hierarchical Latent Variables for Scalable Theory of MindRahath Malladi, Arshia Sangwan, David Danks, Tauhidur Rahman Nigel Doering2026≈ 65%
- Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language ModelsLazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan Juhong Min2026≈ 65%
- ≈ 62%
- Interpreting Language Model Parametersin corpus2026≈ 61%
- ≈ 61%
- ≈ 60%
- ≈ 60%
- ≈ 60%
- The computational boundary of a 'self': developmental bioelectricity drives multicellularity and scale-free cognitionin corpus2019≈ 60%
- ≈ 59%
- ≈ 59%
- ≈ 59%
- Sharing the World with Digital Mindsin corpus≈ 59%
- ≈ 59%
- Collective intelligence: A unifying concept for integrating biology across scales and substratesin corpus2024≈ 59%
Similar preprints — Semantic Scholar
Cited by (1)
- ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS resolves a core trade-off in visual reasoning by introducing functional tokens — single discrete 'words' that simultaneously serve as agentic operations and latent visual reasoning units, elimin