finding
active
finding:probes-trained-on-the-likely-dataset-perform-worse-than-chance-on-datasets-with-anti-correlations-between-text-probability-and-truthProbes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truth
Shows that truth representations are not reducible to text probability representations
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Claims (1)
claim
- LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsassociated_withsupportsEstablishes that the observed linear structure is not merely a representation of text probability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Open question raised in §7.1 about an unexplained anomalous result
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Open question about scale-dependent asymmetry in training data effects