thinker:david-e-rumelhartDavid E. Rumelhart
PDP researcher; foundational work on distributed representations cited in DAS motivation.
Authored papers (1)
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations2023ⓒ 9
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint neuron sets—by using gradient descent over orthogonal rotation matrices to find alignments in non-standard bases of neural representations. On a hierarchical equality task, a three-layer feed-forward network with hidden size 16 achieves 100% interchange intervention accuracy (IIA) under DAS at layer 1 with an 8-dimensional intervention subspace, whereas the best brute-force localist search reaches only 0.60 IIA and the closest localist alignment only 0.73 IIA. On the Monotonicity NLI benchmark, BERT-base fine-tuned on MoNLI achieves 100% IIA at layer 9 when 256 non-standard basis dimensions of the [CLS] token encode lexical entailment and 256 others encode negation, while no localist alignment exceeds 0.51 IIA on the same task. A subsequent subspace decomposition reveals a structural asymmetry: the hierarchical equality representations of w=x and y=z cannot be decomposed into representations of individual input identities (subspace DAS IIA ≈ 0.50–0.51), whereas the apparent lexical-entailment representation in BERT decomposes almost perfectly (IIA ≈ 0.97–0.98) into two word-identity representations. DAS implies that previous negative or weak causal abstraction findings may have been artifacts of the localist assumption, and that neural networks can genuinely implement tree-structured symbolic algorithms—but that apparent relational representations may sometimes be data structures over entity identities rather than true relational encodings.
More papers — OpenAlex / S2
Co-authors (8)
- Atticus Geiger3 shared
- Christopher Potts3 shared
- Noah D. Goodman3 shared
- Thomas Icard3 shared
- Zhengxuan Wu3 shared
- James L. McClelland1 shared
- Judea Pearl1 shared
- Paul Smolensky1 shared
Their work is cited by (6)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- Addressing divergent representations from causal interventions on neural networks1× refs
- pyvene: A Library for Understanding and Improving PyTorch Models via Interventions1× refs
- Model Alignment Search1× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?1× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts1× refs
Recent mentions (1)
- papers-typedgeiger-2023-finding-alignments.md