finding

active

finding:no-collisions-found-in-1-280-000-randomly-sampled-inputs-through-trained-mlp-in-hierarchical-equality-task-across-10-random-seeds

No collisions found in 1,280,000 randomly sampled inputs through trained MLP in hierarchical equality task across 10 random seeds

Empirical support for input-injectivity assumption holding in practice

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Concepts (1)

concept

Input-Injectivity
supports
Assumption that DNN layers preserve input information by being injective; key condition for Theorem 1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality taskfinding0.760
Demonstrates that high IIA can be obtained even when model cannot solve the task
Approximately 0.2% of MLP neurons at layer 18 (~28 neurons) are sufficient to account for the generic addition computation across all cyclic tasksclaim0.737
Claim about the sparsity and sufficiency of the identified neuron set
A sparse set of 28 MLP neurons at layer 18 (~0.2% of MLP) are reused across all cyclic tasksfinding0.734
Quantitative finding identifying the specific neurons responsible for generic addition
Minimal Euclidean distances between hidden states are smaller for pairs sharing same output or equality-variable values than for pairs that do not, across 1,280,000 MLP samplesfinding0.729
Explains why RevNet lacks capacity to separate states for identity-of-first-argument algorithm
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.726
SAE features are not simply mirroring individual neurons.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.723
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.717
Selective pressure toward convergence via task generality
Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.715
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition