finding
active
finding:for-neg-cities-truth-value-and-llama-2-70b-log-probability-correlate-at-r-0-63-for-neg-sp-en-trans-at-r-0-89For neg_cities, truth value and LLaMA-2-70B log probability correlate at r=-0.63; for neg_sp_en_trans at r=-0.89
Demonstrates strong anti-correlation between text probability and truth in negated datasets
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Claims (2)
claim
- Establishes that the observed linear structure is not merely a representation of text probability
- Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Hypothesized intermediate feature explaining antipodal alignment between cities and neg_cities in early-middle layers
- Layer-by-layer evolution of truth direction alignment, supporting hierarchical abstraction hypothesis
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Establishes generalizability of the core difficulty-boundary finding across model families.
- Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
- Core E3 finding validating S as a predictor of anchoring effectiveness
- Case of misalignment showing that the truth direction is not always shared between a dataset and its negation in smaller models
- Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims