question
active
question:why-were-interventions-with-mass-mean-probe-directions-extracted-from-the-likely-dataset-so-effective-despite-these-probes-not-being-accurate-at-classifying-true-false-statementsWhy were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?
Open question raised in §7.1 about an unexplained anomalous result
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Claims (1)
claim
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core result showing MM is superior to LR for causal implication despite similar classification accuracy
- Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
- Open question about scale-dependent asymmetry in training data effects
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Shows that truth representations are not reducible to text probability representations
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.