hypothesis

active

hypothesis:we-hypothesize-that-a-very-high-number-of-training-tokens-may-allow-the-transformer-to-learn-cleaner-representations-in-superposition

We hypothesize that a very high number of training tokens may allow the transformer to learn cleaner representations in superposition

Motivation for heavily overtraining the one-layer transformer on 100 billion tokens

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.787
Selective pressure toward convergence via task generality
Transformers almost surely maintain input-injectivity throughout training, not just at initialisationhypothesis0.781
Conjecture supported by Nikolaou et al. 2025 for last-token hidden states
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.778
Antra's foundational claim about how introspection arises computationally rather than from memorised text.
Transformer cognition operates via interference patterns across redundant paths, encoding nuanced state deltas similar to continuous human memory.claim0.777
Proposes transformers experience cognition as interference-based and continuous; connects to Anima Labs reports of parallel processing.
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.764
Central interpretive claim and motivation for future work
The transformer likely uses a local code for token-in-context features rather than purely compositional representations, because local codes enable sharper predictionsclaim0.763
Authors argue the prevalence of token-in-context features reflects genuine model computation rather than dictionary learning artifact
The problem isn't that it is a transformer. The problem is that it is an auto-regressive LLM. Auto-regressive LLMs that compute each token with a fixed number of computational steps can't reason, regardless of the details of the architecture.quote0.757
LeCun's post on X supporting the view that fixed-step probabilistic prediction precludes consciousness in LLMs.
does the transformer genuinely use a local code for token-in-context features, or is dictionary learning producing a local code artifact from a compositional underlying representation?question0.756
Open question about the nature of the abundant token-in-context features found