hypothesis

active

hypothesis:we-hypothesize-that-group-b-hidden-states-store-a-representation-of-the-statement-s-truth

We hypothesize that group (b) hidden states store a representation of the statement's truth

Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
introduces

Findings (3)

finding

A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgments
supports
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditions
associated_with
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictions
supports
Localizes truth representations to specific hidden states, motivating the rest of the analysis

Methods (1)

method

Causal Intervention via Activation Shifting
supports
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokensclaim0.812
Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations
Internal states appear to encode Bayesian beliefs about hidden external states.claim0.809
The inferential interpretation of internal dynamics.
Hidden Statesconcept0.778
Latent variables causing observations in the generative model.
If internal states encode a probability density over external states, then it should be possible to predict external states from internal states.hypothesis0.754
The testable hypothesis driving the active inference analysis in the simulation.
Larger hidden representations create more random structure that DAS can search through, allowing manipulation of counterfactual behavior even in randomly initialized networkshypothesis0.749
Tested in Section 4.4 calibration experiment; confirmed by findings.
We hypothesize that emotion states are more persistent because they correspond to genuinely stateful internal representations, not merely local surface contenthypothesis0.749
Proposed explanation for why emotion probes are more persistent than variance-matched random probes
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.747
Interpretive synthesis of DIM and cone intervention successes
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.745
Motivating hypothesis for Section 5's investigation of prompt template effects.