paper:guo-atlas-2026ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
TL;DR
ATLAS resolves a core trade-off in visual reasoning by introducing functional tokens — single discrete 'words' that simultaneously serve as agentic operations and latent visual reasoning units, eliminating the need to choose between the two paradigms. Existing approaches split into two camps: agentic methods (code or tool calls) that suffer context-switching latency from external execution, and latent methods (learnable hidden embeddings) that lack cross-task generalization and resist autoregressive parallelization during training. ATLAS, proposed by Guo et al. (2026, arXiv:2605.15198), encodes each functional token with an internalized visual operation yet requires no visual supervision and remains a standard token within the tokenizer vocabulary, making it architecturally compatible with existing autoregressive language model pipelines. The framework couples the interpretability and controllability of agentic tool use with the speed and end-to-end trainability of latent reasoning, without requiring a separate image generation module or unified generative model. The paper argues this implies that discrete symbolic vocabulary tokens are a sufficient and computationally efficient substrate for visual reasoning, and that the agentic/latent dichotomy is a false choice that can be collapsed into a single token-level mechanism applicable across heterogeneous visual tasks.
What to take away
- 1. ATLAS introduces 'functional tokens' — single discrete vocabulary-level tokens that simultaneously encode an agentic visual operation and act as a latent visual reasoning unit, requiring no visual supervision during training.
- 2. Agentic visual reasoning methods (e.g., code- or tool-call-based approaches) incur context-switching latency from external execution engines, which ATLAS eliminates by internalizing the operation within the token itself.
- 3. Latent visual reasoning methods using learnable hidden embeddings suffer from two compounding limitations: poor cross-task generalization and incompatibility with autoregressive parallelization during training.
- 4. Direct image generation through unified generative models during reasoning is identified as computationally expensive and architecturally non-trivial, motivating the lightweight functional-token approach.
- 5. Each functional token in ATLAS is associated with an internalized visual operation but remains a standard token in the tokenizer vocabulary (arXiv:2605.15198), meaning no modifications to the tokenizer architecture or special decoding logic are required.
- 6. ATLAS is framed as combining the strengths of both agentic and latent paradigms while mitigating their respective limitations — controllability and interpretability from agentic methods, speed and trainability from latent methods.
- 7. An open question the paper raises is whether a single discrete token is expressive enough to capture the full diversity of visual reasoning operations across tasks of varying complexity, or whether functional token sets need to be scaled.
- 8. As a replicable methodology choice, functional tokens are introduced into the standard tokenizer vocabulary and trained without any visual supervision signal, meaning the training pipeline requires only standard language modeling objectives applied to an augmented vocabulary.
- 9. The paper's central hypothesis is that the agentic/latent dichotomy in visual reasoning is a false dichotomy, and that a unified discrete-token mechanism is sufficient to subsume both paradigms without architectural compromise.
- 10. The suggested_action metadata (score: 0.195, action: 'read') from the ingestion pipeline indicates the paper was rated low-priority by automated triage, possibly because the abstract alone does not include benchmark numbers, which warrants attention to the full experimental results in guo-atlas-2026-fulltext.md.
Peer brief — for seminar discussion
Guo et al. (arXiv:2605.15198, ingested 2026-05-15) tackle a structural tension in visual reasoning research: existing methods bifurcate into agentic approaches (tool calls, code execution) that pay context-switching latency costs, and latent approaches (learnable hidden embeddings) that fail to generalize across tasks and resist autoregressive parallelization. A third option — generating images directly through unified generative models during the reasoning chain — is dismissed as computationally expensive and architecturally non-trivial. The framework introduced is ATLAS, which collapses this dichotomy using functional tokens: single discrete words that live in the standard tokenizer vocabulary, each associated with an internalized visual operation, trained without any visual supervision signal. Because the functional token is a standard vocabulary item, it requires no modification to the decoding pipeline and is compatible with autoregressive parallelization, addressing the two core failure modes of latent methods simultaneously. The load-bearing claim is that one word is sufficient to act as both an agentic operator and a latent visual reasoning unit — implying that the agentic/latent distinction is architecturally unnecessary and that discrete symbolic tokens are an adequate substrate for intermediate visual states. An alternative the paper could have used is chain-of-thought with explicit tool-call syntax (as in models like GPT-4V or Gemini with code execution), which would preserve interpretability but not address the latency or parallelization problems. A critical reader would immediately push back on the generalization claim: the abstract asserts that latent methods 'lack task generalization,' but whether functional tokens actually generalize better across heterogeneous visual benchmarks depends entirely on experimental results not visible in the truncated text provided — without those numbers (model names, dataset names, accuracy deltas), the architectural argument is plausible but unverified. The paper's core prediction is that functional tokens will match or exceed both pure agentic and pure latent baselines on visual reasoning tasks while incurring lower inference latency than agentic methods and higher cross-task transfer than latent embedding methods.
Frameworks (1)
- ATLAS FrameworkA framework where a single discrete word (functional token) serves both agentic operation and latent visual reasoning, requiring no visual supervision.
Findings (2)
- ATLAS LA-GRPO achieves 51.3% on BLINK average, improving from baseline 22.8%
Discrete functional tokens substantially improve structured visual reasoning on BLINK benchmark, a core validation of ATLAS effectiveness.
- Gradient Dilution Issue
During RL training on ATLAS, sparse functional tokens (2.3% of sequences) receive diluted gradient signals from sequence-level advantages propagated across all tokens.
Claims (2)
- Keeping functional-token vocabulary compact minimizes perturbation to base model token distribution
ATLAS design philosophy: five functional tokens suffice to abstract common visual operations without excessive disruption.
- Token-level supervision enables models to learn functional-token invocation from reasoning context
ATLAS author's assertion that functional tokens optimized via standard cross-entropy loss learn when and how to invoke operations from surrounding text.
Hypotheses (1)
- Five functional tokens can generalize across 40+ diverse visual reasoning tasks
ATLAS hypothesis that a compact set of high-level functional tokens (Manip, Shape, Line, Arrow, Text) suffices for multi-domain visual reasoning.
Questions (1)
- How can visual reasoning be preserved within discrete autoregressive sequences without external tools or pixel-level supervision?
Core research question addressed by ATLAS: bridging interpretability of agentic methods, efficiency of discrete tokens, and scalability of autoregressive training.
Original abstract (expand)
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model AgentsGangadharan G.R., Rajkumar Buyya Arunkumar V2026≈ 82%
- Visual Enhanced Depth Scaling for Multimodal Latent ReasoningYong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu Yudong Han2026≈ 81%
- ≈ 81%
- Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future DirectionsFadi Dornaika Mohamad Abou Ali2025≈ 81%
- Visual Agentic AI for Spatial Reasoning with a Dynamic APIRohun Agrawal, Yisong Yue and Georgia Gkioxari Damiano Marsili2025≈ 80%
- Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMsQiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou Xin Zhang2026≈ 80%
- Agent models: Internalizing Chain-of-Action Generation into Reasoning modelsYuqi Yang, Jiangming Shu, Xinyan Wen, Jitao Sang Yuxiang Zhang2025≈ 80%
- Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual ReasoningXinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren Jin Cui2026≈ 80%
- Agentic AI Frameworks: Architectures, Protocols, and Design ChallengesZaki Brahmi, Haithem Mazeni Hana Derouiche2025≈ 80%
- Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AIJinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang Jitao Sang2025≈ 80%
- Brain-Inspired Graph Multi-Agent Systems for LLM ReasoningYuming Dai, Xianzhe Qin, Shan Yu Guangfu Hao2026≈ 80%
- ≈ 80%
- GUI Agents with Reinforcement Learning: Toward Digital InhabitantsJian Liu, Jingxiang Lai, Jiarui Hu, Yiwei Sheng, Shuang Chen, Jian Li, Dazhao Du, Song Guo Junan Hu2026≈ 79%
- ≈ 79%
- LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual ReasoningLei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang Haihong Hao2026≈ 79%
- Taking AI Welfare Seriouslyin corpus2024≈ 77%
- ≈ 76%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 76%
- ≈ 76%
- Cognitive glues are shared models of relative scarcities: the economics of collective intelligencein corpus2026≈ 76%
- The Platonic Representation Hypothesisin corpus2024≈ 76%
- Simulators — LessWrongin corpus≈ 75%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 75%
- ≈ 75%
- ≈ 75%
- ≈ 75%
- ≈ 75%
- ≈ 75%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 74%
- ≈ 74%
+28 more