finding
active
finding:minimal-euclidean-distances-between-hidden-states-are-smaller-for-pairs-sharing-same-output-or-equality-variable-values-than-for-pairs-that-do-not-across-1-280-000-mlp-samplesMinimal Euclidean distances between hidden states are smaller for pairs sharing same output or equality-variable values than for pairs that do not, across 1,280,000 MLP samples
Explains why RevNet lacks capacity to separate states for identity-of-first-argument algorithm
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- We hypothesize that group (b) hidden states store a representation of the statement's truthhypothesis0.735Motivating hypothesis driving the remainder of the paper's analysis after patching localization
- SAE features are not simply mirroring individual neurons.
- Validates MAS as a causal detector of representational differences invisible to correlative methods.
- Interpretive synthesis of DIM and cone intervention successes
- Empirical support for input-injectivity assumption holding in practice
- Important caveat to the CL loss solution, noting it is a step not a complete fix
- Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations