thinker:aryaman-aroraAryaman Arora
Authored papers (2)
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-means, LDA, PCA, k-means, and random baselines in causally influencing language model behavior via 1D distributed interchange intervention. Across the pythia model family (14M–6.9B parameters), DAS achieves an average log odds-ratio of 10.74 on pythia-1b compared to 3.66 for probing and 3.17 for difference-in-means, measured over 400 training and 100 evaluation examples per task. However, when selectivity is computed by subtracting performance on control tasks that require arbitrary token mappings—an adaptation of Hewitt and Liang's (2019) probing control paradigm—the gap between DAS and probing narrows substantially, revealing that DAS's advantage partly reflects its expressivity rather than genuine causal alignment. Applying DAS to track training checkpoints of pythia-1b on NPI licensing (npi_any_subj-relc) and filler-gap dependencies (filler_gap_subj) shows that the causal mechanism for both phenomena emerges in discrete stages—not gradually—with information traversing multiple intermediate token positions before reaching the output, and both mechanisms appearing fully only after step 2000–3000 of training. The paper argues this implies that interpretability evaluation requires causal interventional paradigms rather than behavioral or representational proxies alone, and that psycholinguistic LM research should move beyond surprisal comparisons toward mechanistic analysis.
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstraction, expressed in a serializable dict-based configuration that can be shared via HuggingFace. Prior libraries (BauKit, TransformerLens, nnsight, graphpatch, Transformer Debugger) either lack extensibility to recurrent and convolutional architectures or require sophisticated custom code for multi-source, cross-forward-pass interventions; pyvene resolves both limitations with Getter/Setter hooks that track state variables enabling intervention at arbitrary time steps in GRU and other recurrent models. The library ships with trainable intervention types including RotatedSpaceIntervention (Distributed Alignment Search), LowRankRotatedSpaceIntervention, and BoundlessRotatedSpaceIntervention, and reproduces Meng et al.'s factual-association localization result in GPT2-XL in approximately 20 lines of code. A second case study on Pythia-6.9B demonstrates that a 1D DAS intervention finds sparse, causally localized gender representations across layers, whereas a linear probe achieves near-100% classification accuracy almost everywhere—implying that high probe accuracy is insufficient evidence of causal relevance, and that trainable interventions provide a strictly more diagnostic test of whether a representation is mechanistically load-bearing for a behavior.
More papers — OpenAlex / S2
Affiliations (1)
- Stanford University(institute)
Co-authors (8)
- Christopher Potts18 shared
- Atticus Geiger9 shared
- Christopher D. Manning9 shared
- Dan Jurafsky9 shared
- Jing Huang9 shared
- Noah D. Goodman9 shared
- Zheng Wang9 shared
- Zhengxuan Wu9 shared
Their work is cited by (5)
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?6× refs
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks3× refs
- Addressing divergent representations from causal interventions on neural networks3× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior3× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts3× refs
Recent mentions (2)
- papers-typedwu-2024-pyvene-library.md
- papers-typedarora-2024-causalgym-benchmarking.md