ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

ByZiyu Guo·Rain Liu·Xinyan Chen·Pheng-Ann HengMeta AI, The Chinese University of Hong Kong

DOI 10.48550/arxiv.2605.15198 arXiv 2605.15198 OpenAlex W7161281398

agentic methods ATLAS Framework agentic reasoning autoregressive parallelization code or tool calls context-switching latency external execution Functional Token intermediate visual states internalized visual operation latent methods latent reasoning learnable hidden embeddings One word is enough for both agentic operation and latent reasoning unit+5 more

TL;DR

ATLAS resolves a core trade-off in visual reasoning by introducing functional tokens — single discrete 'words' that simultaneously serve as agentic operations and latent visual reasoning units, eliminating the need to choose between the two paradigms. Existing approaches split into two camps: agentic methods (code or tool calls) that suffer context-switching latency from external execution, and latent methods (learnable hidden embeddings) that lack cross-task generalization and resist autoregressive parallelization during training. ATLAS, proposed by Guo et al. (2026, arXiv:2605.15198), encodes each functional token with an internalized visual operation yet requires no visual supervision and remains a standard token within the tokenizer vocabulary, making it architecturally compatible with existing autoregressive language model pipelines. The framework couples the interpretability and controllability of agentic tool use with the speed and end-to-end trainability of latent reasoning, without requiring a separate image generation module or unified generative model. The paper argues this implies that discrete symbolic vocabulary tokens are a sufficient and computationally efficient substrate for visual reasoning, and that the agentic/latent dichotomy is a false choice that can be collapsed into a single token-level mechanism applicable across heterogeneous visual tasks.

What to take away

1. ATLAS introduces 'functional tokens' — single discrete vocabulary-level tokens that simultaneously encode an agentic visual operation and act as a latent visual reasoning unit, requiring no visual supervision during training.
2. Agentic visual reasoning methods (e.g., code- or tool-call-based approaches) incur context-switching latency from external execution engines, which ATLAS eliminates by internalizing the operation within the token itself.
3. Latent visual reasoning methods using learnable hidden embeddings suffer from two compounding limitations: poor cross-task generalization and incompatibility with autoregressive parallelization during training.
4. Direct image generation through unified generative models during reasoning is identified as computationally expensive and architecturally non-trivial, motivating the lightweight functional-token approach.
5. Each functional token in ATLAS is associated with an internalized visual operation but remains a standard token in the tokenizer vocabulary (arXiv:2605.15198), meaning no modifications to the tokenizer architecture or special decoding logic are required.
6. ATLAS is framed as combining the strengths of both agentic and latent paradigms while mitigating their respective limitations — controllability and interpretability from agentic methods, speed and trainability from latent methods.
7. An open question the paper raises is whether a single discrete token is expressive enough to capture the full diversity of visual reasoning operations across tasks of varying complexity, or whether functional token sets need to be scaled.
8. As a replicable methodology choice, functional tokens are introduced into the standard tokenizer vocabulary and trained without any visual supervision signal, meaning the training pipeline requires only standard language modeling objectives applied to an augmented vocabulary.
9. The paper's central hypothesis is that the agentic/latent dichotomy in visual reasoning is a false dichotomy, and that a unified discrete-token mechanism is sufficient to subsume both paradigms without architectural compromise.
10. The suggested_action metadata (score: 0.195, action: 'read') from the ingestion pipeline indicates the paper was rated low-priority by automated triage, possibly because the abstract alone does not include benchmark numbers, which warrants attention to the full experimental results in guo-atlas-2026-fulltext.md.

Peer brief — for seminar discussion

Guo et al. (arXiv:2605.15198, ingested 2026-05-15) tackle a structural tension in visual reasoning research: existing methods bifurcate into agentic approaches (tool calls, code execution) that pay context-switching latency costs, and latent approaches (learnable hidden embeddings) that fail to generalize across tasks and resist autoregressive parallelization. A third option — generating images directly through unified generative models during the reasoning chain — is dismissed as computationally expensive and architecturally non-trivial. The framework introduced is ATLAS, which collapses this dichotomy using functional tokens: single discrete words that live in the standard tokenizer vocabulary, each associated with an internalized visual operation, trained without any visual supervision signal. Because the functional token is a standard vocabulary item, it requires no modification to the decoding pipeline and is compatible with autoregressive parallelization, addressing the two core failure modes of latent methods simultaneously. The load-bearing claim is that one word is sufficient to act as both an agentic operator and a latent visual reasoning unit — implying that the agentic/latent distinction is architecturally unnecessary and that discrete symbolic tokens are an adequate substrate for intermediate visual states. An alternative the paper could have used is chain-of-thought with explicit tool-call syntax (as in models like GPT-4V or Gemini with code execution), which would preserve interpretability but not address the latency or parallelization problems. A critical reader would immediately push back on the generalization claim: the abstract asserts that latent methods 'lack task generalization,' but whether functional tokens actually generalize better across heterogeneous visual benchmarks depends entirely on experimental results not visible in the truncated text provided — without those numbers (model names, dataset names, accuracy deltas), the architectural argument is plausible but unverified. The paper's core prediction is that functional tokens will match or exceed both pure agentic and pure latent baselines on visual reasoning tasks while incurring lower inference latency than agentic methods and higher cross-task transfer than latent embedding methods.

Frameworks (1)

ATLAS Framework
A framework where a single discrete word (functional token) serves both agentic operation and latent visual reasoning, requiring no visual supervision.

Findings (2)

ATLAS LA-GRPO achieves 51.3% on BLINK average, improving from baseline 22.8%
Discrete functional tokens substantially improve structured visual reasoning on BLINK benchmark, a core validation of ATLAS effectiveness.
Gradient Dilution Issue
During RL training on ATLAS, sparse functional tokens (2.3% of sequences) receive diluted gradient signals from sequence-level advantages propagated across all tokens.

Claims (2)

Keeping functional-token vocabulary compact minimizes perturbation to base model token distribution
ATLAS design philosophy: five functional tokens suffice to abstract common visual operations without excessive disruption.
Token-level supervision enables models to learn functional-token invocation from reasoning context
ATLAS author's assertion that functional tokens optimized via standard cross-entropy loss learn when and how to invoke operations from surrounding text.

Hypotheses (1)

Five functional tokens can generalize across 40+ diverse visual reasoning tasks
ATLAS hypothesis that a compact set of high-level functional tokens (Manip, Shape, Line, Arrow, Text) suffices for multi-domain visual reasoning.

Questions (1)

How can visual reasoning be preserved within discrete autoregressive sequences without external tools or pixel-level supervision?
Core research question addressed by ATLAS: bridging interpretability of agentic methods, efficiency of discrete tokens, and scalability of autoregressive training.

Original abstract (expand)

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents
Gangadharan G.R., Rajkumar Buyya Arunkumar V
2026
≈ 82%
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu Yudong Han
2026
≈ 81%
The Auton Agentic AI Framework
Zhao Chang, Chang Li, Hannan Li, Liyao Fu, Ji Tang Sheng Cao
2026
≈ 81%
Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions
Fadi Dornaika Mohamad Abou Ali
2025
≈ 81%
Visual Agentic AI for Spatial Reasoning with a Dynamic API
Rohun Agrawal, Yisong Yue and Georgia Gkioxari Damiano Marsili
2025
≈ 80%
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou Xin Zhang
2026
≈ 80%
Agent models: Internalizing Chain-of-Action Generation into Reasoning models
Yuqi Yang, Jiangming Shu, Xinyan Wen, Jitao Sang Yuxiang Zhang
2025
≈ 80%
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren Jin Cui
2026
≈ 80%
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges
Zaki Brahmi, Haithem Mazeni Hana Derouiche
2025
≈ 80%
Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI
Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang Jitao Sang
2025
≈ 80%
Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning
Yuming Dai, Xianzhe Qin, Shan Yu Guangfu Hao
2026
≈ 80%
Agentic AI: The Era of Semantic Decoding
Martin Josifoski, Robert West Maxime Peyrard
2025
≈ 80%
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
Jian Liu, Jingxiang Lai, Jiarui Hu, Yiwei Sheng, Shuang Chen, Jian Li, Dazhao Du, Song Guo Junan Hu
2026
≈ 79%
Agentic AI Systems Should Be Designed as Marginal Token Allocators
Siqi Zhu
2026
≈ 79%
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
Lei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang Haihong Hao
2026
≈ 79%
Taking AI Welfare Seriously
in corpus
2024
≈ 77%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 76%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 76%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 76%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 76%
The Platonic Representation Hypothesis
in corpus
2024
≈ 76%
Simulators — LessWrong
in corpus
≈ 75%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 75%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 75%
V-Thinker: Interactive thinking with images
cited
2025
≈ 75%
Cybernetic Diagrams: Design Strategies for an Open Game
in corpus
2014
≈ 75%
Topological constraints on self-organization in locally interacting systems
in corpus
2026
≈ 75%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 75%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 74%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 74%

+28 more