finding
active
finding:a-small-group-of-hidden-states-group-b-over-end-of-sentence-punctuation-tokens-is-highly-causally-implicated-in-truth-judgmentsA small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgments
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Claims (1)
claim
- A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokensrestatessupportsLocalization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
Hypotheses (1)
hypothesis
- Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Concepts (1)
concept
- Summarization Token BehaviorsupportsBehavior where information about full clauses is encoded over clause-ending punctuation tokens in LLMs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Localizes truth representations to specific hidden states, motivating the rest of the analysis
- Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.769Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Shows absence of abstract truth representations in smallest model, supporting scale-dependent emergence claim
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Application to transformer language models
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.759Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Supported by the neutral read-prompt changing emergence but not fully replicating ask-correct cross-task generalization.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.