claim

active

claim:a-small-group-of-causally-implicated-hidden-states-encodes-llm-truth-representations-localized-over-clause-ending-punctuation-tokens

A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokens

Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
introduces

Findings (2)

finding

A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgments
restatessupports
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
LLaMA-2-70B displays summarization behavior over punctuation tokens in a context-dependent way: present for cities but not for sp_en_trans
contradicts
Contrasts with 7B and 13B which show consistent summarization behavior; may complicate localization at 70B scale

Concepts (1)

concept

Summarization Behavior
supports
The phenomenon where LLMs encode clause-level information over clause-ending punctuation tokens rather than the final content token

Methods (1)

method

Residual Stream Patching
supports
Technique to localize causally implicated hidden states by swapping residual stream activations between a true and false input and measuring downstream log-probability changes

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that group (b) hidden states store a representation of the statement's truthhypothesis0.812
Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictionsfinding0.808
Localizes truth representations to specific hidden states, motivating the rest of the analysis
Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.778
Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.774
Establishes that the observed linear structure is not merely a representation of text probability
Do LLMs have a unified representation of truth that spans structurally and topically diverse data?question0.768
Central research question driving dataset design and experimental approach
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.768
Central empirical conclusion of the paper about the fundamental limits of truth directions.
Internal states appear to encode Bayesian beliefs about hidden external states.claim0.765
The inferential interpretation of internal dynamics.
Are LLM emotion states encoded only selectively in token positions where they are operative, or in a more complex non-linear manner?question0.763
Question raised by Anthropic and partially addressed by this paper's persistence evidence

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgments