question

active

question:why-were-interventions-with-mass-mean-probe-directions-extracted-from-the-likely-dataset-so-effective-despite-these-probes-not-being-accurate-at-classifying-true-false-statements

Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?

Open question raised in §7.1 about an unexplained anomalous result

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
associated_with

Findings (1)

finding

MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
gates
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans

Claims (1)

claim

Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
gates
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.840
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?question0.808
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
Why did mass-mean probing with cities+neg_cities training data perform poorly for the 70B model, despite larger_than+smaller_than performing well?question0.800
Open question about scale-dependent asymmetry in training data effects
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.793
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Probe-based data attribution effectively reduces harmful behaviors via data interventionsclaim0.791
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.790
Shows that truth representations are not reducible to text probability representations
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.790
Motivating hypothesis for Section 5's investigation of prompt template effects.
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.779
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.