finding

active

finding:a-small-group-of-hidden-states-group-b-over-end-of-sentence-punctuation-tokens-is-highly-causally-implicated-in-truth-judgments

A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgments

Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokens
restatessupports
Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations

Hypotheses (1)

hypothesis

We hypothesize that group (b) hidden states store a representation of the statement's truth
supports
Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Concepts (1)

concept

Summarization Token Behavior
supports
Behavior where information about full clauses is encoded over clause-ending punctuation tokens in LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictionsfinding0.842
Localizes truth representations to specific hidden states, motivating the rest of the analysis
Little evidence of steganography in NLAs; meaning-preserving transformations cause only small drops in FVEfinding0.769
Quantitative evaluation showing NLAs do not heavily rely on covert encoding beyond overt language.
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.766
SAEs uncover safety-relevant representations that might be monitored or controlled.
In LLaMA-2-7B, PCA of larger_than+smaller_than shows statements clustering by surface-level characteristics (e.g., presence of token 'eighty') rather than truth valuefinding0.764
Shows absence of abstract truth representations in smallest model, supporting scale-dependent emergence claim
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.760
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)finding0.760
Application to transformer language models
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.759
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Truth-evaluation framing specifically contributes to truth geometry shifts beyond generic instruction-following prefix.claim0.757
Supported by the neutral read-prompt changing emergence but not fully replicating ask-correct cross-task generalization.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokens