finding

active

finding:mass-mean-probe-directions-outperform-lr-and-ccs-in-causal-intervention-experiments-nie-in-7-8-experimental-conditions

Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditions

Core result showing MM is superior to LR for causal implication despite similar classification accuracy

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (1)

claim

Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
associated_withsupports
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence

Hypotheses (1)

hypothesis

We hypothesize that group (b) hidden states store a representation of the statement's truth
associated_with
Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?question0.840
Open question raised in §7.1 about an unexplained anomalous result
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.785
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?question0.779
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.777
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
Mass-mean probes generalize about as well as LR and CCS for LLaMA-2-13B and 70Bfinding0.776
Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale
Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?question0.767
Open question about scale-dependent asymmetry in training data effects
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.761
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.760
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions