paper
active
2026
paper:guo-atlas-2026

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

TL;DR

ATLAS resolves a core trade-off in visual reasoning by introducing functional tokens — single discrete 'words' that simultaneously serve as agentic operations and latent visual reasoning units, eliminating the need to choose between the two paradigms. Existing approaches split into two camps: agentic methods (code or tool calls) that suffer context-switching latency from external execution, and latent methods (learnable hidden embeddings) that lack cross-task generalization and resist autoregressive parallelization during training. ATLAS, proposed by Guo et al. (2026, arXiv:2605.15198), encodes each functional token with an internalized visual operation yet requires no visual supervision and remains a standard token within the tokenizer vocabulary, making it architecturally compatible with existing autoregressive language model pipelines. The framework couples the interpretability and controllability of agentic tool use with the speed and end-to-end trainability of latent reasoning, without requiring a separate image generation module or unified generative model. The paper argues this implies that discrete symbolic vocabulary tokens are a sufficient and computationally efficient substrate for visual reasoning, and that the agentic/latent dichotomy is a false choice that can be collapsed into a single token-level mechanism applicable across heterogeneous visual tasks.

What to take away

  1. 1. ATLAS introduces 'functional tokens' — single discrete vocabulary-level tokens that simultaneously encode an agentic visual operation and act as a latent visual reasoning unit, requiring no visual supervision during training.
  2. 2. Agentic visual reasoning methods (e.g., code- or tool-call-based approaches) incur context-switching latency from external execution engines, which ATLAS eliminates by internalizing the operation within the token itself.
  3. 3. Latent visual reasoning methods using learnable hidden embeddings suffer from two compounding limitations: poor cross-task generalization and incompatibility with autoregressive parallelization during training.
  4. 4. Direct image generation through unified generative models during reasoning is identified as computationally expensive and architecturally non-trivial, motivating the lightweight functional-token approach.
  5. 5. Each functional token in ATLAS is associated with an internalized visual operation but remains a standard token in the tokenizer vocabulary (arXiv:2605.15198), meaning no modifications to the tokenizer architecture or special decoding logic are required.
  6. 6. ATLAS is framed as combining the strengths of both agentic and latent paradigms while mitigating their respective limitations — controllability and interpretability from agentic methods, speed and trainability from latent methods.
  7. 7. An open question the paper raises is whether a single discrete token is expressive enough to capture the full diversity of visual reasoning operations across tasks of varying complexity, or whether functional token sets need to be scaled.
  8. 8. As a replicable methodology choice, functional tokens are introduced into the standard tokenizer vocabulary and trained without any visual supervision signal, meaning the training pipeline requires only standard language modeling objectives applied to an augmented vocabulary.
  9. 9. The paper's central hypothesis is that the agentic/latent dichotomy in visual reasoning is a false dichotomy, and that a unified discrete-token mechanism is sufficient to subsume both paradigms without architectural compromise.
  10. 10. The suggested_action metadata (score: 0.195, action: 'read') from the ingestion pipeline indicates the paper was rated low-priority by automated triage, possibly because the abstract alone does not include benchmark numbers, which warrants attention to the full experimental results in guo-atlas-2026-fulltext.md.

Peer brief — for seminar discussion

Guo et al. (arXiv:2605.15198, ingested 2026-05-15) tackle a structural tension in visual reasoning research: existing methods bifurcate into agentic approaches (tool calls, code execution) that pay context-switching latency costs, and latent approaches (learnable hidden embeddings) that fail to generalize across tasks and resist autoregressive parallelization. A third option — generating images directly through unified generative models during the reasoning chain — is dismissed as computationally expensive and architecturally non-trivial. The framework introduced is ATLAS, which collapses this dichotomy using functional tokens: single discrete words that live in the standard tokenizer vocabulary, each associated with an internalized visual operation, trained without any visual supervision signal. Because the functional token is a standard vocabulary item, it requires no modification to the decoding pipeline and is compatible with autoregressive parallelization, addressing the two core failure modes of latent methods simultaneously. The load-bearing claim is that one word is sufficient to act as both an agentic operator and a latent visual reasoning unit — implying that the agentic/latent distinction is architecturally unnecessary and that discrete symbolic tokens are an adequate substrate for intermediate visual states. An alternative the paper could have used is chain-of-thought with explicit tool-call syntax (as in models like GPT-4V or Gemini with code execution), which would preserve interpretability but not address the latency or parallelization problems. A critical reader would immediately push back on the generalization claim: the abstract asserts that latent methods 'lack task generalization,' but whether functional tokens actually generalize better across heterogeneous visual benchmarks depends entirely on experimental results not visible in the truncated text provided — without those numbers (model names, dataset names, accuracy deltas), the architectural argument is plausible but unverified. The paper's core prediction is that functional tokens will match or exceed both pure agentic and pure latent baselines on visual reasoning tasks while incurring lower inference latency than agentic methods and higher cross-task transfer than latent embedding methods.

Frameworks (1)

  • ATLAS Framework
    A framework where a single discrete word (functional token) serves both agentic operation and latent visual reasoning, requiring no visual supervision.

Findings (2)

Claims (2)

Hypotheses (1)

Questions (1)

Original abstract (expand)

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+28 more

Similar preprints — Semantic Scholar