claim
active
claim:a-small-group-of-causally-implicated-hidden-states-encodes-llm-truth-representations-localized-over-clause-ending-punctuation-tokensA small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokens
Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (2)
finding
- A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsrestatessupportsPatching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- Contrasts with 7B and 13B which show consistent summarization behavior; may complicate localization at 70B scale
Concepts (1)
concept
- Summarization BehaviorsupportsThe phenomenon where LLMs encode clause-level information over clause-ending punctuation tokens rather than the final content token
Methods (1)
method
- Residual Stream PatchingsupportsTechnique to localize causally implicated hidden states by swapping residual stream activations between a true and false input and measuring downstream log-probability changes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- We hypothesize that group (b) hidden states store a representation of the statement's truthhypothesis0.812Motivating hypothesis driving the remainder of the paper's analysis after patching localization
- Localizes truth representations to specific hidden states, motivating the rest of the analysis
- Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.778Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
- Establishes that the observed linear structure is not merely a representation of text probability
- Do LLMs have a unified representation of truth that spans structurally and topically diverse data?question0.768Central research question driving dataset design and experimental approach
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- The inferential interpretation of internal dynamics.
- Question raised by Anthropic and partially addressed by this paper's persistence evidence
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.