question
active
question:given-a-language-model-m-and-a-statement-s-does-m-believe-s-to-be-trueGiven a language model M and a statement s, does M believe s to be true?
The core motivating question of the paper, framed by Christiano et al. (2021)
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Establishes that the observed linear structure is not merely a representation of text probability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
- Suggestive evidence for language-independent truth representation in LLMs
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.780In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
- What if the concept being manipulated does not lie on a straight line in the model's representations?question0.780The motivating question that opens the paper and leads to the development of manifold steering.
- Core definitional quote for performative chain-of-thought
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.778Safety intervention that relies on activation modification, which ESR might undermine