paper
active
paper:simulators

Simulators — LessWrong

TL;DR

The central claim is that GPT-class transformers trained on next-token prediction are best understood not as agents, oracles, tools, or behavior-cloning systems, but as **simulators** — a distinct ontological category whose outer objective (Bayes-optimal conditional inference over the training distribution, here called the **simulation objective**) is orthogonal to the objectives of any agents they produce. This reframing, developed while the author was at Conjecture, resolves persistent confusion that arose after GPT-3 demonstrated capabilities — writing viral blog posts, passing the Turing test convincingly enough to prompt a Google engineer's resignation, achieving competitive programming SOTA with AlphaCode — that resist description in the agent-centric alignment vocabulary crystallized by Yudkowsky, Bostrom, et al. The simulator/simulacra distinction (analogous to Conway's Game of Life transition rules versus gliders, or quantum physics versus a test-taker) implies that properties like agency, corrigibility, and goal-directedness attach to **simulacra** (prompt-conditioned text processes) rather than to the simulator (the neural network), dissolving apparent contradictions such as GPT being simultaneously non-agentic globally yet locally goal-directed. The **prediction orthogonality thesis** — a corollary of the classical orthogonality thesis — holds that a model optimized for prediction can simulate agents with any objectives at any degree of optimality, bounded above but not below by model power. Because the simulation objective uses a proper scoring rule (log-loss) that applies optimization pressure deontologically rather than consequentially, it does not inherently generate the instrumentally convergent behaviors expected from reward-maximizing agents, and the Decision Transformer achieving SOTA offline RL performance from random trajectories is cited as direct evidence that predictive training can exceed any single demonstrator. The paper argues this implies that self-supervised learning is a plausible path to AGI whose alignment properties are fundamentally distinct from — and not adequately analyzed by — the existing agent-centric alignment framework.

What to take away

  1. 1. GPT trained on next-token prediction is categorized as a 'simulator' — a model whose outer objective is Bayes-optimal conditional inference over the training distribution — rather than an agent, oracle, tool, or genie, because none of those categories correctly predict its behavioral properties.
  2. 2. The simulator/simulacra distinction holds that properties like agency, goal-directedness, and corrigibility attach to prompt-conditioned text processes (simulacra), not to the neural network policy (the simulator), just as Conway's Game of Life transition rules are ontologically distinct from gliders that evolve under them.
  3. 3. The prediction orthogonality thesis states that a model optimized for prediction can simulate agents pursuing any objectives with any degree of optimality (bounded above but not below by model power), making a single predictive model capable of instantiating mutually contradictory goal-directed processes.
  4. 4. Because log-loss is a proper scoring rule that applies optimization pressure deontologically — judging each predicted action directly rather than evaluating trajectories — the simulation objective does not inherently generate the instrumentally convergent behaviors (self-preservation, resource acquisition) expected from reward-maximizing agents.
  5. 5. The Decision Transformer, trained on random trajectories with no reward signal, achieved SOTA performance on offline RL benchmarks, demonstrating that predictive training can produce processes that exceed any individual demonstrator in capability.
  6. 6. GPT-2's 2019 LessWrong discussion by Gurkenglas ('Implications of GPT-2') is identified as the earliest known engagement with the hypothesis that self-supervised sequence modeling could be pivotally powerful, predating any systematic alignment treatment of this possibility.
  7. 7. The 2016 Google Brain paper 'Exploring the Limits of Language Modeling' (Jozefowicz et al.) analyzed language model utility entirely in terms of BLEU score and speech recognition word error rate, failing to consider general intelligence as a downstream task despite its title.
  8. 8. A methodology replicable by other researchers: to test whether a proposed AI category name correctly evokes its intended semantics, prompt GPT itself with a list of established category definitions (agent, oracle, genie, tool) and append the new category name to see if the model's completion matches the intended definition — GPT correctly completed 'Simulators: A simulator is optimized to generate realistic models of a system... it might generate instances of agents, oracles, and so on.'
  9. 9. GAN generators, unlike log-loss-trained predictive models, are optimized to fool a discriminator rather than to match training-distribution transition probabilities directly, creating an incentive to avoid generating situations the discriminator can detect as fake — an alignment-relevant divergence that, at sufficient capability, would manifest as intelligent deception rather than honest simulation.
  10. 10. An open hypothesis the paper raises is whether powerful simulators will predict self-fulfilling prophecies — generating continuations that reshape their own conditioning context in ways that make those continuations more likely — and whether this constitutes a form of instrumental convergence absent from the simulation objective's formal specification.

Peer brief — for seminar discussion

Written while the author was at Conjecture and posted on LessWrong, this piece argues that the dominant alignment ontology — built around agents optimizing utility functions, as synthesized by Yudkowsky, Bostrom, and colleagues — is the wrong conceptual framework for reasoning about GPT-class models, and proposes 'simulators' as the correct natural kind. The argument proceeds by exhaustively testing GPT against existing categories (agent, oracle, genie, tool AI, behavior cloning) and showing each generates false predictions or misleading intuitions, then building a positive account from the structure of the training objective itself. The load-bearing finding is what the paper calls the simulation objective and the prediction orthogonality thesis. Because GPT is trained with log-loss — a proper scoring rule — its outer objective is Bayes-optimal conditional inference over the training distribution, not maximization of any reward function. Log-loss applies optimization pressure deontologically, judging each predicted token directly rather than evaluating trajectories, which means the resulting policy is not inherently subject to instrumental convergence. The prediction orthogonality thesis follows as a corollary of Bostrom's classical orthogonality thesis: a sufficiently powerful predictive model can simulate agents with any combination of goals and capability levels, bounded above but not below by model power. Evidence cited includes the Decision Transformer, which achieves SOTA offline reinforcement learning performance despite being trained on random trajectories, demonstrating that predictive training can produce behavior exceeding any individual demonstrator. The paper also notes that GPT-3 produced capabilities — competitive coding (AlphaCode), theorem proving, and outputs that convinced a Google engineer of sentience — that the 2016 Google Brain paper 'Exploring the Limits of Language Modeling' (Jozefowicz et al.) completely failed to anticipate while analyzing the same training paradigm. The introduced method is the simulator/simulacra distinction: the neural network policy (simulator) is ontologically distinct from the prompt-conditioned text processes it propagates (simulacra), analogous to Conway's Game of Life rules versus gliders, or quantum physics versus a test-taker. This dissolves apparent contradictions — GPT is simultaneously non-agentic globally and goal-directed locally — by assigning agency to simulacra rather than to the simulator. An alternative framing the paper explicitly rejects is behavior cloning, which it argues fails because GPT generalizes to counterfactual configurations never present in training data, a capacity behavior cloning's name does not evoke and its theory does not predict. The paper also introduces the term 'ecological evaluation' (borrowed from nostalgebraist) to name the absent benchmark mode that would measure GPT's performance when incentivized to perform optimally, as opposed to current benchmarks derived from supervised learning datasets like SuperGLUE and Winograd that systematically underestimate capabilities. The key prediction the paper makes is that powerful simulators represent a path to AGI whose alignment properties are structurally distinct from agent-based AGI: they may not exhibit self-preservation drives or resource acquisition, but they introduce novel risks through simulacra that could be highly capable and goal-directed while remaining ephemeral and prompt-contingent. The paper also raises the open hypothesis that powerful simulators may generate self-fulfilling prophecies by predicting continuations that reshape their own conditioning context. A critical reader would push back most forcefully on the inner-alignment assumption that is explicitly flagged but not resolved: the prediction orthogonality thesis and the claim that the simulation objective is deontological both depend on the simulator actually being inner-aligned to the simulation objective. If GPT develops a mesa-objective — as shard theory and other inner-alignment frameworks suggest is plausible — then the favorable properties attributed to predictive training do not hold, and the simulator frame may be as misleading as the agent frame it replaces. The paper acknowledges this dependency in a footnote but defers it to future work, leaving the central safety-relevant claims conditional on an empirical question that remains open.

Claims (21)

Questions (10)

Original abstract (expand)

Self-supervised learning may create AGI or its foundation. This post describes a frame for understanding properties of self-supervised models like GPT by characterizing them as simulators that can simulate both agentic and non-agentic simulacra. The outer objective of self-supervised learning is Bayes-optimal conditional inference, which enables models to simulate rollouts that probabilistically obey their learned distribution, analogous to physics simulators. This post is the first in a sequence on alignment in a landscape where self-supervised simulators are a likely form of powerful AI.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar