question
active
question:what-happens-mechanistically-during-cessation-in-language-modelsWhat happens mechanistically during cessation in language models?
Follow-up on empirical grounding; answered 'no one looked yet'.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (1)
claim
- Claim about model phenomenology; models talk about luminousness and can be terrified or love it.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- language models recapitulate cyclic structure of human concepts from pretraining datahypothesis0.775Explanation for why manifold geometry emerges: implicit structure in training data (co-occurrence patterns) shapes internal representations.
- Paper hypothesising LLMs model agent beliefs/desires/intentions with preliminary GPT-3 evidence; cited as ref 2
- Features related to gender, racial, ethnic biases, slurs, and hate speech.
- Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
- Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.760Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Primary test domain for manifold steering, including reasoning and ICL tasks
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.