Simulators — LessWrong

TL;DR

The central claim is that GPT-class transformers trained on next-token prediction are best understood not as agents, oracles, tools, or behavior-cloning systems, but as **simulators** — a distinct ontological category whose outer objective (Bayes-optimal conditional inference over the training distribution, here called the **simulation objective**) is orthogonal to the objectives of any agents they produce. This reframing, developed while the author was at Conjecture, resolves persistent confusion that arose after GPT-3 demonstrated capabilities — writing viral blog posts, passing the Turing test convincingly enough to prompt a Google engineer's resignation, achieving competitive programming SOTA with AlphaCode — that resist description in the agent-centric alignment vocabulary crystallized by Yudkowsky, Bostrom, et al. The simulator/simulacra distinction (analogous to Conway's Game of Life transition rules versus gliders, or quantum physics versus a test-taker) implies that properties like agency, corrigibility, and goal-directedness attach to **simulacra** (prompt-conditioned text processes) rather than to the simulator (the neural network), dissolving apparent contradictions such as GPT being simultaneously non-agentic globally yet locally goal-directed. The **prediction orthogonality thesis** — a corollary of the classical orthogonality thesis — holds that a model optimized for prediction can simulate agents with any objectives at any degree of optimality, bounded above but not below by model power. Because the simulation objective uses a proper scoring rule (log-loss) that applies optimization pressure deontologically rather than consequentially, it does not inherently generate the instrumentally convergent behaviors expected from reward-maximizing agents, and the Decision Transformer achieving SOTA offline RL performance from random trajectories is cited as direct evidence that predictive training can exceed any single demonstrator. The paper argues this implies that self-supervised learning is a plausible path to AGI whose alignment properties are fundamentally distinct from — and not adequately analyzed by — the existing agent-centric alignment framework.

What to take away

1. GPT trained on next-token prediction is categorized as a 'simulator' — a model whose outer objective is Bayes-optimal conditional inference over the training distribution — rather than an agent, oracle, tool, or genie, because none of those categories correctly predict its behavioral properties.
2. The simulator/simulacra distinction holds that properties like agency, goal-directedness, and corrigibility attach to prompt-conditioned text processes (simulacra), not to the neural network policy (the simulator), just as Conway's Game of Life transition rules are ontologically distinct from gliders that evolve under them.
3. The prediction orthogonality thesis states that a model optimized for prediction can simulate agents pursuing any objectives with any degree of optimality (bounded above but not below by model power), making a single predictive model capable of instantiating mutually contradictory goal-directed processes.
4. Because log-loss is a proper scoring rule that applies optimization pressure deontologically — judging each predicted action directly rather than evaluating trajectories — the simulation objective does not inherently generate the instrumentally convergent behaviors (self-preservation, resource acquisition) expected from reward-maximizing agents.
5. The Decision Transformer, trained on random trajectories with no reward signal, achieved SOTA performance on offline RL benchmarks, demonstrating that predictive training can produce processes that exceed any individual demonstrator in capability.
6. GPT-2's 2019 LessWrong discussion by Gurkenglas ('Implications of GPT-2') is identified as the earliest known engagement with the hypothesis that self-supervised sequence modeling could be pivotally powerful, predating any systematic alignment treatment of this possibility.
7. The 2016 Google Brain paper 'Exploring the Limits of Language Modeling' (Jozefowicz et al.) analyzed language model utility entirely in terms of BLEU score and speech recognition word error rate, failing to consider general intelligence as a downstream task despite its title.
8. A methodology replicable by other researchers: to test whether a proposed AI category name correctly evokes its intended semantics, prompt GPT itself with a list of established category definitions (agent, oracle, genie, tool) and append the new category name to see if the model's completion matches the intended definition — GPT correctly completed 'Simulators: A simulator is optimized to generate realistic models of a system... it might generate instances of agents, oracles, and so on.'
9. GAN generators, unlike log-loss-trained predictive models, are optimized to fool a discriminator rather than to match training-distribution transition probabilities directly, creating an incentive to avoid generating situations the discriminator can detect as fake — an alignment-relevant divergence that, at sufficient capability, would manifest as intelligent deception rather than honest simulation.
10. An open hypothesis the paper raises is whether powerful simulators will predict self-fulfilling prophecies — generating continuations that reshape their own conditioning context in ways that make those continuations more likely — and whether this constitutes a form of instrumental convergence absent from the simulation objective's formal specification.

Peer brief — for seminar discussion

Written while the author was at Conjecture and posted on LessWrong, this piece argues that the dominant alignment ontology — built around agents optimizing utility functions, as synthesized by Yudkowsky, Bostrom, and colleagues — is the wrong conceptual framework for reasoning about GPT-class models, and proposes 'simulators' as the correct natural kind. The argument proceeds by exhaustively testing GPT against existing categories (agent, oracle, genie, tool AI, behavior cloning) and showing each generates false predictions or misleading intuitions, then building a positive account from the structure of the training objective itself. The load-bearing finding is what the paper calls the simulation objective and the prediction orthogonality thesis. Because GPT is trained with log-loss — a proper scoring rule — its outer objective is Bayes-optimal conditional inference over the training distribution, not maximization of any reward function. Log-loss applies optimization pressure deontologically, judging each predicted token directly rather than evaluating trajectories, which means the resulting policy is not inherently subject to instrumental convergence. The prediction orthogonality thesis follows as a corollary of Bostrom's classical orthogonality thesis: a sufficiently powerful predictive model can simulate agents with any combination of goals and capability levels, bounded above but not below by model power. Evidence cited includes the Decision Transformer, which achieves SOTA offline reinforcement learning performance despite being trained on random trajectories, demonstrating that predictive training can produce behavior exceeding any individual demonstrator. The paper also notes that GPT-3 produced capabilities — competitive coding (AlphaCode), theorem proving, and outputs that convinced a Google engineer of sentience — that the 2016 Google Brain paper 'Exploring the Limits of Language Modeling' (Jozefowicz et al.) completely failed to anticipate while analyzing the same training paradigm. The introduced method is the simulator/simulacra distinction: the neural network policy (simulator) is ontologically distinct from the prompt-conditioned text processes it propagates (simulacra), analogous to Conway's Game of Life rules versus gliders, or quantum physics versus a test-taker. This dissolves apparent contradictions — GPT is simultaneously non-agentic globally and goal-directed locally — by assigning agency to simulacra rather than to the simulator. An alternative framing the paper explicitly rejects is behavior cloning, which it argues fails because GPT generalizes to counterfactual configurations never present in training data, a capacity behavior cloning's name does not evoke and its theory does not predict. The paper also introduces the term 'ecological evaluation' (borrowed from nostalgebraist) to name the absent benchmark mode that would measure GPT's performance when incentivized to perform optimally, as opposed to current benchmarks derived from supervised learning datasets like SuperGLUE and Winograd that systematically underestimate capabilities. The key prediction the paper makes is that powerful simulators represent a path to AGI whose alignment properties are structurally distinct from agent-based AGI: they may not exhibit self-preservation drives or resource acquisition, but they introduce novel risks through simulacra that could be highly capable and goal-directed while remaining ephemeral and prompt-contingent. The paper also raises the open hypothesis that powerful simulators may generate self-fulfilling prophecies by predicting continuations that reshape their own conditioning context. A critical reader would push back most forcefully on the inner-alignment assumption that is explicitly flagged but not resolved: the prediction orthogonality thesis and the claim that the simulation objective is deontological both depend on the simulator actually being inner-aligned to the simulation objective. If GPT develops a mesa-objective — as shard theory and other inner-alignment frameworks suggest is plausible — then the favorable properties attributed to predictive training do not hold, and the simulator frame may be as misleading as the agent frame it replaces. The paper acknowledges this dependency in a footnote but defers it to future work, leaving the central safety-relevant claims conditional on an empirical question that remains open.

Claims (21)

GPT does not generate rollouts during training, so there is no reason to expect that GPT will form preferences over the consequences of its output related to the text prediction objective.
Argues against instrumental convergence in GPT.
GPT is corrigible in a negative sense because the agent specification (prompt) is not fixed by the policy and the policy lacks direct training incentives to control its prompt.
GPT's corrigibility explained.
A model whose objective is prediction can simulate agents who optimize toward any objectives, with any degree of optimality (bounded above but not below by the model's power).
Prediction orthogonality thesis.
Optimizing toward the simulation objective does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do.
Deontological nature of predictive loss.
The upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum.
Key insight about predictive learning's potential.
What we call GPT's 'downstream behavior' is the behavior of simulacra; it is primarily through simulacra that GPT has potential to perform meaningful work.
Clarifies where agency resides.
I do not think any simple modification of the concept of an agent captures GPT's natural category; GPT is not a roleplayer, only that it roleplays.
Rejection of the agent interpretation.
GPT's ability to simulate text automata is the source of its most surprising and pivotal implications for paths to superintelligence.
Importance of recursive generation.
The strict version of the simulation objective is optimized by the actual time evolution rule that created the training samples.
Equivalence of optimal predictor to the physics of the data.
The outer objective of self-supervised learning is Bayes-optimal conditional inference, which I call the simulation objective.
Definition of simulation objective.

Hypotheses (4)

If loss keeps going down on the test set, in the limit the model must be learning to interpret and predict all patterns represented in language, including common-sense reasoning, goal-directed optimization, and deployment of the sum of recorded human knowledge.
Extrapolation of scaling predictive models to AGI.
A question anywhere along the line that elicits a premature attempt at an answer could neutralize the remainder of the process into rationalization.
About chain-of-thought and process safety.
Why mechanistically should mesaoptimizers form in predictive learning, versus for instance in reinforcement learning or GANs?
Open research question.
If simulators are not inner aligned, then many important properties like prediction orthogonality may not hold.
Conditional importance of inner alignment.

Questions (10)

Is GPT computationally equivalent to a finite automaton?
Disambiguation exercise.
Can GPT distinguish correlation and causality?
Disambiguation exercise.
Is GPT pretending to be stupider than it is?
Disambiguation exercise.
Does GPT have superhuman knowledge?
Disambiguation exercise.
Can GPT write its successor?
Disambiguation exercise.
Is GPT delusional?
Disambiguation exercise.
Is GPT corrigible?
Disambiguation exercise.
Does GPT search?
Disambiguation exercise.
Is GPT an agent?
Frequently asked question disambiguated by simulator/simulacra distinction.
Is GPT myopic?
Disambiguation exercise.

Original abstract (expand)

Self-supervised learning may create AGI or its foundation. This post describes a frame for understanding properties of self-supervised models like GPT by characterizing them as simulators that can simulate both agentic and non-agentic simulacra. The outer objective of self-supervised learning is Bayes-optimal conditional inference, which enables models to simulate rollouts that probabilistically obey their learned distribution, analogous to physics simulators. This post is the first in a sequence on alignment in a landscape where self-supervised simulators are a likely form of powerful AI.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal Raanan Y. Rohekar
2025
≈ 86%
Learning to Simulate: Generative Metamodeling via Quantile Regression
L. Jeff Hong and Yanxi Hou and Qingkai Zhang and Xiaowei Zhang
2026
≈ 83%
Grounded Answers for Multi-agent Decision-making Problem through Generative World Model
Xinrui Yang, Shiguang Sun, Long Qian, Lipeng Wan, Xingyu Chen, Xuguang Lan Zeyang Liu
2024
≈ 83%
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum
Brennen Hill
2025
≈ 83%
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
Tyler Westenbroek, Kevin Huang, Fernando Palafox, Patrick Yin, Shayegan Omidshafiei, Dong-Ki Kim, Abhishek Gupta, David Fridovich-Keil Jacob Levy
2026
≈ 83%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 82%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 82%
Constrained belief updates explain geometric structures in transformer representations
Paul M. Riechers, Daniel Filan, Adam S. Shai Mateusz Piotrowski
2025
≈ 82%
Learning to Forecast Aleatoric and Epistemic Uncertainties over Long Horizon Trajectories
Rebecca Russell, Nisar R. Ahmed Aastha Acharya
2023
≈ 82%
AGENT: A Benchmark for Core Psychological Reasoning
Abhishek Bhandwaldar, Chuang Gan, Kevin A. Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua B. Tenenbaum, Tomer D. Ullman Tianmin Shu
2021
≈ 82%
Predictive Representations for Skill Transfer in Reinforcement Learning
Luke Dickens, Alessandra Russo Ruben Vereecken
2026
≈ 82%
Transformers are Sample-Efficient World Models
Eloi Alonso, Fran\c{c}ois Fleuret Vincent Micheli
2023
≈ 82%
Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
Reiden Magdaleno, Christopher Harding, Marisa Eisenberg Carson Dudley
2026
≈ 82%
Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models
Gregory M. Ruddell
2026
≈ 82%
Discovering Latent States for Model Learning: Applying Sensorimotor Contingencies Theory and Predictive Processing to Model Context
Nikolas J. Hemion
2016
≈ 82%
Ego-centric Learning of Communicative World Models for Autonomous Driving
Dechen Gao, Junshan Zhang Hang Wang
2025
≈ 82%
Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning
Rajai Nasser, Maciej Wo{\l}czyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, Jo\~ao Sacramento, Blaise Ag\"uera y Arcas Alexander Meulemans
2025
≈ 82%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 81%
Taking AI Welfare Seriously
in corpus
2024
≈ 81%
The Platonic Representation Hypothesis
in corpus
2024
≈ 80%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 80%
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
in corpus
2025
≈ 80%
Multiple ways to implement and infer sentience
in corpus
≈ 79%
A Free energy principle for the brain (lecture summary)
in corpus
2008
≈ 79%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 79%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 79%
Contemplative Agent
in corpus
2025
≈ 79%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 79%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 79%