finding

active

finding:mm-probe-trained-on-likely-dataset-achieves-nie-of-0-70-false-true-on-llama-2-13b-surprisingly-strong-but-weaker-than-truth-probes

MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes

Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Claims (3)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
associated_withsupports
Establishes that the observed linear structure is not merely a representation of text probability
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs
supports
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separator
supports
Motivates the introduction of mass-mean probing as an alternative to LR

Questions (1)

question

Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?
gates
Open question raised in §7.1 about an unexplained anomalous result

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.843
Larger models linearly represent more general concepts including truth
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.840
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.835
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.814
Shows that truth representations are not reducible to text probability representations
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.794
Shows behavioral pattern of self-correction is trainable in smaller models
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.791
Establishes generalizability of the core difficulty-boundary finding across model families.
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.785
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Interest probe: peak Cohen's d=1.67 (layer 14), p=9.45×10⁻⁶ in LLaMA-3.2-3Bfinding0.784
Probe validation result confirming interest direction captures meaningful structure