What happens mechanistically during cessation in language models?

Follow-up on empirical grounding; answered 'no one looked yet'.

Source paper

extracted_from

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

language models recapitulate cyclic structure of human concepts from pretraining datahypothesis0.775
Explanation for why manifold geometry emerges: implicit structure in training data (co-occurrence patterns) shapes internal representations.
Andreas 2022: Language models as agent modelsconcept0.767
Paper hypothesising LLMs model agent beliefs/desires/intentions with preliminary GPT-3 evidence; cited as ref 2
Bias in language modelsconcept0.766
Features related to gender, racial, ethnic biases, slurs, and hate speech.
In some sense, this is the simplest language model we profoundly don't understand. And so it makes a natural target for our paper.quote0.761
Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
Language Modelsconcept0.760
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.760
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Language Modelconcept0.760
Primary test domain for manifold steering, including reasoning and ICL tasks
Language models are few-shot learners (Brown et al., 2020)concept0.758
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.