claim

active

claim:simple-difference-in-mean-probes-generalize-as-well-as-other-probing-techniques-while-identifying-directions-which-are-more-causally-implicated-in-model-outputs

Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs

Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
introduces

Findings (3)

finding

Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditions
associated_withsupports
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
supports
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Mass-mean probes generalize about as well as LR and CCS for LLaMA-2-13B and 70B
supports
Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale

Questions (1)

question

Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?
gates
Open question raised in §7.1 about an unexplained anomalous result

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.792
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.791
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.785
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.782
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
A probe achieving high classification accuracy provides no guarantee that the model actually distinguishes those classes in downstream computationsclaim0.781
Motivation for causal evaluation over purely behavioural probing accuracy
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.781
Key limitation acknowledged by authors.
Probe Generalizationconcept0.781
The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.780
The motivating question that opens the paper and leads to the development of manifold steering.