thinker:zhengxuan-wuZhengxuan Wu
Authored papers (2)
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstraction, expressed in a serializable dict-based configuration that can be shared via HuggingFace. Prior libraries (BauKit, TransformerLens, nnsight, graphpatch, Transformer Debugger) either lack extensibility to recurrent and convolutional architectures or require sophisticated custom code for multi-source, cross-forward-pass interventions; pyvene resolves both limitations with Getter/Setter hooks that track state variables enabling intervention at arbitrary time steps in GRU and other recurrent models. The library ships with trainable intervention types including RotatedSpaceIntervention (Distributed Alignment Search), LowRankRotatedSpaceIntervention, and BoundlessRotatedSpaceIntervention, and reproduces Meng et al.'s factual-association localization result in GPT2-XL in approximately 20 lines of code. A second case study on Pythia-6.9B demonstrates that a 1D DAS intervention finds sparse, causally localized gender representations across layers, whereas a linear probe achieves near-100% classification accuracy almost everywhere—implying that high probe accuracy is insufficient evidence of causal relevance, and that trainable interventions provide a strictly more diagnostic test of whether a representation is mechanistically load-bearing for a behavior.
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations2023ⓒ 9
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint neuron sets—by using gradient descent over orthogonal rotation matrices to find alignments in non-standard bases of neural representations. On a hierarchical equality task, a three-layer feed-forward network with hidden size 16 achieves 100% interchange intervention accuracy (IIA) under DAS at layer 1 with an 8-dimensional intervention subspace, whereas the best brute-force localist search reaches only 0.60 IIA and the closest localist alignment only 0.73 IIA. On the Monotonicity NLI benchmark, BERT-base fine-tuned on MoNLI achieves 100% IIA at layer 9 when 256 non-standard basis dimensions of the [CLS] token encode lexical entailment and 256 others encode negation, while no localist alignment exceeds 0.51 IIA on the same task. A subsequent subspace decomposition reveals a structural asymmetry: the hierarchical equality representations of w=x and y=z cannot be decomposed into representations of individual input identities (subspace DAS IIA ≈ 0.50–0.51), whereas the apparent lexical-entailment representation in BERT decomposes almost perfectly (IIA ≈ 0.97–0.98) into two word-identity representations. DAS implies that previous negative or weak causal abstraction findings may have been artifacts of the localist assumption, and that neural networks can genuinely implement tree-structured symbolic algorithms—but that apparent relational representations may sometimes be data structures over entity identities rather than true relational encodings.
More papers — OpenAlex / S2
Affiliations (1)
- Stanford University(institute)
Co-authors (12)
- Atticus Geiger18 shared
- Christopher Potts18 shared
- Noah D. Goodman18 shared
- Aryaman Arora9 shared
- Christopher D. Manning9 shared
- Jing Huang9 shared
- Thomas Icard9 shared
- Zheng Wang9 shared
- David E. Rumelhart3 shared
- James L. McClelland3 shared
- Judea Pearl3 shared
- Paul Smolensky3 shared
Their work is cited by (6)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks6× refs
- Addressing divergent representations from causal interventions on neural networks6× refs
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?6× refs
- pyvene: A Library for Understanding and Improving PyTorch Models via Interventions3× refs
- Model Alignment Search3× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts3× refs
Other inbound relations (3)
Recent mentions (6)
- papers-typedgrant-2025-addressing-divergent.md
- papers-typedgrant-2025-alignment-search.md
- papers-typedwu-2024-pyvene-library.md
- papers-typedarora-2024-causalgym-benchmarking.md
- papers-typedgeiger-2023-finding-alignments.md
- papers-typedblas-2026-psychological.md