Aryaman Arora

openalex A5082261951 name_hash 3f8014c2f7557f0d76116d80…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (2)

CausalGym: Benchmarking causal interpretability methods on linguistic tasks2024
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-means, LDA, PCA, k-means, and random baselines in causally influencing language model behavior via 1D distributed interchange intervention. Across the pythia model family (14M–6.9B parameters), DAS achieves an average log odds-ratio of 10.74 on pythia-1b compared to 3.66 for probing and 3.17 for difference-in-means, measured over 400 training and 100 evaluation examples per task. However, when selectivity is computed by subtracting performance on control tasks that require arbitrary token mappings—an adaptation of Hewitt and Liang's (2019) probing control paradigm—the gap between DAS and probing narrows substantially, revealing that DAS's advantage partly reflects its expressivity rather than genuine causal alignment. Applying DAS to track training checkpoints of pythia-1b on NPI licensing (npi_any_subj-relc) and filler-gap dependencies (filler_gap_subj) shows that the causal mechanism for both phenomena emerges in discrete stages—not gradually—with information traversing multiple intermediate token positions before reaching the output, and both mechanisms appearing fully only after step 2000–3000 of training. The paper argues this implies that interpretability evaluation requires causal interventional paradigms rather than behavioral or representational proxies alone, and that psycholinguistic LM research should move beyond surprisal comparisons toward mechanistic analysis.
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions2024ⓒ 1
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstraction, expressed in a serializable dict-based configuration that can be shared via HuggingFace. Prior libraries (BauKit, TransformerLens, nnsight, graphpatch, Transformer Debugger) either lack extensibility to recurrent and convolutional architectures or require sophisticated custom code for multi-source, cross-forward-pass interventions; pyvene resolves both limitations with Getter/Setter hooks that track state variables enabling intervention at arbitrary time steps in GRU and other recurrent models. The library ships with trainable intervention types including RotatedSpaceIntervention (Distributed Alignment Search), LowRankRotatedSpaceIntervention, and BoundlessRotatedSpaceIntervention, and reproduces Meng et al.'s factual-association localization result in GPT2-XL in approximately 20 lines of code. A second case study on Pythia-6.9B demonstrates that a 1D DAS intervention finds sparse, causally localized gender representations across layers, whereas a linear probe achieves near-100% classification accuracy almost everywhere—implying that high probe accuracy is insufficient evidence of causal relevance, and that trainable interventions provide a strictly more diagnostic test of whether a representation is mechanistically load-bearing for a behavior.

Aryaman Arora

Authored papers (2)

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (8)

Their work is cited by (5)

Recent mentions (2)