finding

active

finding:transformers-learn-in-context-by-gradient-descent-functioning-as-mesa-optimizers-that-learn-internal-models-in-real-time

Transformers learn in-context by gradient descent, functioning as mesa-optimizers that learn internal models in real time

Evidence that in-context learning is not mere pattern matching but genuine optimization, relevant to applying the thesis to inference

Source paper

extracted_from

Why Learning Requires Feeling

(2026) · Cameron Berg

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Johannes von Oswald
introduces
Transformers learn in-context by gradient descent.

Claims (1)

claim

If in-context learning involves signed evaluation in the service of behavioral modification, then the thesis applies not only to training but to every inference-time interaction
supports
Extension of the thesis to deployed LLM inference via in-context learning

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.845
Antra's foundational claim about how introspection arises computationally rather than from memorised text.
Transformers almost surely maintain input-injectivity throughout training, not just at initialisationhypothesis0.811
Conjecture supported by Nikolaou et al. 2025 for last-token hidden states
In-Context Learning as Optimizationconcept0.801
Transformers use an anti-Markovian solution that recomputes relevant numeric information at each step in the Multi-Object taskclaim0.797
Prior finding from Grant et al. 2025 used to interpret low MAS IIA for GRU-Transformer hidden state comparisons.
Learning to encode position for transformer with continuous dynamical model (Liu et al., 2020)concept0.793
Prior work on learned dynamic position encodings; cited alongside Wang et al. as precedent.
In-Context Learning of Representations (Park et al. 2025)framework0.781
Reports phase-like breakpoints and geometry changes as context scales; UCCT provides measurable predictor
does the transformer genuinely use a local code for token-in-context features, or is dictionary learning producing a local code artifact from a compositional underlying representation?question0.771
Open question about the nature of the abundant token-in-context features found
When a model discovers that its outputs produce effects, it accelerates learning through in-context learning, analogous to lucid dreaming.claim0.767
Describes scaffolding method and the model's meta-learning loop.