finding

active

finding:patching-group-b-hidden-states-over-clause-ending-punctuation-early-middle-layers-in-llama-2-13b-produces-the-strongest-causal-effect-on-true-false-output-predictions

Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictions

Localizes truth representations to specific hidden states, motivating the rest of the analysis

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

We hypothesize that group (b) hidden states store a representation of the statement's truth
supports
Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsfinding0.842
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokensclaim0.808
Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.792
Central interpretive claim of the paper supported by causal ablation and activation evidence
In early layers, LLaMA-2-13B represents a 'close association' feature that correlates with truth on cities but anti-correlates on neg_citiesclaim0.787
Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.786
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
In LLaMA-2-13B, salient linear structure in the top PCs rapidly emerges in early-middle layers, with this emergence occurring later for conjunctive statements than simple statementsfinding0.785
Layer-wise emergence pattern supporting hierarchical development hypothesis
Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting satisfies Criteria 1 and 2 under both IIT 3.0 and IIT 4.0 (temporal permutation).finding0.782
One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.
Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.781
Empirical demonstration that MDVP produces divergent representations in a real LLM