finding
active
finding:mass-mean-probe-directions-outperform-lr-and-ccs-in-causal-intervention-experiments-nie-in-7-8-experimental-conditionsMass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditions
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Claims (1)
claim
- Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsassociated_withsupportsKey methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Hypotheses (1)
hypothesis
- We hypothesize that group (b) hidden states store a representation of the statement's truthassociated_withMotivating hypothesis driving the remainder of the paper's analysis after patching localization
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Open question raised in §7.1 about an unexplained anomalous result
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Despite being simpler and optimization-free, MM probes match accuracy of other techniques at scale
- Open question about scale-dependent asymmetry in training data effects
- Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
- Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions