ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

/Users/antonborzov/Documents/Research.nosync/papers/guo-atlas-2026.md

External IDs

arxiv

2605.15198

title_hash

8ae50ab6ef38de61236eb2d501e510b48f1848a3

legacy_slug

guo-atlas-2026

doi

10.48550/arxiv.2605.15198

Frontmatter (18 fields)

{
  "doi": "10.48550/arxiv.2605.15198",
  "url": "arxiv:2605.15198",
  "title": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both",
  "authors": [
    "Ziyu Guo",
    "Rain Liu",
    "Xinyan Chen",
    "Pheng-Ann Heng"
  ],
  "abstract": "Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.",
  "arxiv_id": "2605.15198",
  "pdf_path": "/Users/antonborzov/Documents/Research.nosync/papers/guo-atlas-2026.pdf",
  "openalex_id": "W7161281398",
  "canonical_url": "arxiv:2605.15198",
  "fulltext_path": "/Users/antonborzov/Documents/Research.nosync/papers/guo-atlas-2026-fulltext.md",
  "ingest_status": "ok",
  "openalex_year": 2026,
  "triage_signals": {
    "gap_match": 0,
    "author_match": 0,
    "claim_impact": 0,
    "cohort_signal": 0,
    "recency_decay": 0.989,
    "priority_score": 0.195,
    "venue_priority": 0.65,
    "citation_density": 0,
    "concept_transfer": 0,
    "multica_priority": "low",
    "vector_relevance": 0.52,
    "prediction_evidence": 0
  },
  "suggested_action": "read",
  "openalex_enriched_at": 1778975778,
  "openalex_match_title": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both",
  "triage_priority_score": 0.195,
  "openalex_cited_by_count": 0
}

Outgoing (0)

None.

Incoming (4)

Authored by (4)

Pheng-Ann Heng(thinker)
Rain Liu(thinker)
Xinyan Chen(thinker)
Ziyu Guo(thinker)

References (30)

System card: Claude opus 4 & claude sonnet 4
referenced-only
Multimodal chain-of-thought reasoning in language models
referenced-only
Dual-balancing for multi-task learning
referenced-only
DeepSeekMath: Pushing the limits of mathematical reasoning in open language models
referenced-only
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
referenced-only
Quiet-star: Language models can teach themselves to think before speaking
referenced-only
Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation
referenced-only
LLaVA-OneVision: Easy visual task transfer
referenced-only
Show-O: One single transformer to unify multimodal understanding and generation
referenced-only
Janus: Decoupling visual encoding for unified multimodal understanding and generation
referenced-only
Training large language models to reason in a continuous latent space
referenced-only
Imagine while reasoning in space: Multimodal visualization-of-thought
referenced-only
Efficient reasoning with hidden thinking
referenced-only
MME-CoT: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency
referenced-only
Qwen2.5-VL technical report
referenced-only
MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning
referenced-only
Gemma 3 technical report
referenced-only
SOTA with less: MCTS-guided sample selection for data-efficient visual reasoning self-improvement
referenced-only
Unified multimodal understanding and generation models: Advances, challenges, and opportunities
referenced-only
Flow-GRPO: Training flow matching models via online RL
referenced-only
DanceGRPO: Unleashing GRPO on visual generation
referenced-only
DeepEyes: Incentivizing "thinking with images" via reinforcement learning
referenced-only
Emerging properties in unified multimodal pretraining
referenced-only
Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning
referenced-only
Learning to reason without external rewards
referenced-only
MInT-CoT: Enabling interleaved visual tokens in mathematical chain-of-thought reasoning
referenced-only
Multi-step visual reasoning with visual tokens scaling and verification
referenced-only
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities
referenced-only
LLaVA-OneVision-1.5: Fully open framework for democratized multimodal training
referenced-only
Latent visual reasoning
referenced-only

Mentions (1)

papers
/Users/antonborzov/Documents/Research.nosync/papers/guo-atlas-2026-fulltext.md
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models dur