finding

active

finding:minimal-euclidean-distances-between-hidden-states-are-smaller-for-pairs-sharing-same-output-or-equality-variable-values-than-for-pairs-that-do-not-across-1-280-000-mlp-samples

Minimal Euclidean distances between hidden states are smaller for pairs sharing same output or equality-variable values than for pairs that do not, across 1,280,000 MLP samples

Explains why RevNet lacks capacity to separate states for identity-of-first-argument algorithm

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.748
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
We hypothesize that group (b) hidden states store a representation of the statement's truthhypothesis0.735
Motivating hypothesis driving the remainder of the paper's analysis after patching localization
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.735
SAE features are not simply mirroring individual neurons.
MAS IIA is low for GRU hidden states vs Transformer hidden states on Multi-Object task, consistent with anti-Markovian transformer solutionfinding0.732
Validates MAS as a causal detector of representational differences invisible to correlative methods.
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.731
Interpretive synthesis of DIM and cone intervention successes
No collisions found in 1,280,000 randomly sampled inputs through trained MLP in hierarchical equality task across 10 random seedsfinding0.729
Empirical support for input-injectivity assumption holding in practice
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surfaceclaim0.728
Important caveat to the CL loss solution, noting it is a step not a complete fix
A small group of causally-implicated hidden states encodes LLM truth representations, localized over clause-ending punctuation tokensclaim0.727
Localization result from patching experiments; identifies group (b) hidden states as the locus of truth representations