pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

ByZhengxuan Wu ⓘ·Atticus Geiger ⓘ·Aryaman Arora·Jing Huang ⓘ·Zheng Wang ⓘ·Noah D. Goodman ⓘ+2 moreGoodfire, Harvard University + 2 more

DOI 10.48550/arxiv.2403.07809 arXiv 2403.07809 OpenAlex W4392781392

Causal Tracing Backpack Language Models Amnesic Probing Counterfactual State transformer architecture Boundless DAS Gender Representation in LLMs Causal Scrubbing Getter and Setter Hooks Causal Structural Probe Interchange Intervention Accuracy (IIA)Distributed Alignment Search Intervenable Configuration Inference-Time Intervention (ITI)+12 more

TL;DR

pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstraction, expressed in a serializable dict-based configuration that can be shared via HuggingFace. Prior libraries (BauKit, TransformerLens, nnsight, graphpatch, Transformer Debugger) either lack extensibility to recurrent and convolutional architectures or require sophisticated custom code for multi-source, cross-forward-pass interventions; pyvene resolves both limitations with Getter/Setter hooks that track state variables enabling intervention at arbitrary time steps in GRU and other recurrent models. The library ships with trainable intervention types including RotatedSpaceIntervention (Distributed Alignment Search), LowRankRotatedSpaceIntervention, and BoundlessRotatedSpaceIntervention, and reproduces Meng et al.'s factual-association localization result in GPT2-XL in approximately 20 lines of code. A second case study on Pythia-6.9B demonstrates that a 1D DAS intervention finds sparse, causally localized gender representations across layers, whereas a linear probe achieves near-100% classification accuracy almost everywhere—implying that high probe accuracy is insufficient evidence of causal relevance, and that trainable interventions provide a strictly more diagnostic test of whether a representation is mechanistically load-bearing for a behavior.

What to take away

1. pyvene introduces the IntervenableConfig and IntervenableModel abstractions, which express interventions as serializable dict-based configuration objects rather than runtime code hooks, enabling sharing through HuggingFace or any model hub.
2. Meng et al.'s factual-association localization result in GPT2-XL (a 48-layer model) is fully reproduced in approximately 20 lines of pyvene code, demonstrating the library's conciseness for complex causal tracing experiments.
3. On Pythia-6.9B, a trainable 1D Distributed Alignment Search (DAS) intervention finds sparse gender-representing subspaces across layers, while a linear probe achieves near-100% accuracy at almost every layer and token position—showing the two methods disagree sharply on which representations are causally relevant.
4. pyvene's Getter/Setter hook system records a state variable per hook so that interventions can be triggered at a specific time step in recurrent models (e.g., GRU with h_dim=32), a capability absent from all prior hook-based libraries including BauKit, nnsight, graphpatch, and Transformer Debugger.
5. The library supports parallel multi-source interchange interventions in which activations from multiple source forward passes are simultaneously patched into a base computation graph; in a GPT-2 experiment mixing tokens from 'The language of Spain' and 'The capital of Italy', 'Italian' appears in the top-5 output logits.
6. Serial multi-source interventions are also supported, where an activation patched into an intermediate source forward pass propagates into a subsequent base forward pass, and 'Italian' again appears in the top-5 logits after chaining two such interventions through GPT-2's residual stream.
7. To replicate the inference-time intervention of Li et al. (2023a) on TinyStories-33M, pyvene adds a static word embedding for 'happy' or 'sad' to the MLP output at every decoding step across all layers with a coefficient of 0.3, demonstrating per-decoding-step intervention support for generative LMs.
8. An open question raised by the gender-localization case study is whether the divergence between DAS intervention accuracy (IIA) and linear probe accuracy is a general phenomenon across tasks and model scales, or specific to simple grammatical features in large models like Pythia-6.9B.
9. To replicate the DAS training setup, a researcher should construct counterfactual pairs from a template ('John/Sarah walked because he/she') with 47 male and 10 female names, train a LowRankRotatedSpaceIntervention with low_rank_dimension=1 at each Transformer block output layer using cross-entropy loss against the gold counterfactual pronoun token.
10. pyvene is published on PyPI (pip install pyvene) and provides more than 20 tutorials spanning simple feed-forward networks, Transformers, recurrent models, and multi-modal models, with all code runnable on Google Colab.

Peer brief — for seminar discussion

Wu et al. introduce pyvene, an open-source Python library distributed via PyPI that reframes neural network intervention as a first-class, serializable abstraction rather than bespoke hook code. Where existing tools—BauKit, TransformerLens, nnsight, graphpatch, and OpenAI's Transformer Debugger—implement interventions as runtime code executed on a single forward pass, pyvene's IntervenableConfig and IntervenableModel classes allow multi-source, cross-forward-pass, parallel, and serial interventions to be specified declaratively, stored, and shared through HuggingFace. The library also handles recurrent architectures (GRU, convolutional models) by maintaining per-hook state variables that gate execution to a target time step—something vanilla hook approaches cannot do. Two case studies anchor the claims. First, Meng et al.'s causal tracing experiment localizing factual associations in GPT2-XL (48 layers) is reproduced in roughly 20 lines of pyvene code, validating the API's expressiveness against a well-known benchmark result. Second, gender localization in Pythia-6.9B (a 32-layer, ~7B-parameter model from the Pythia suite) using a 1D Distributed Alignment Search (DAS) intervention—pyvene's LowRankRotatedSpaceIntervention with low_rank_dimension=1—shows that DAS finds sparse causally relevant subspaces, whereas a linear probe achieves near-100% accuracy across virtually every layer and token position. The load-bearing implication is that probe accuracy is not a reliable proxy for causal relevance: a representation can be linearly decodable without being mechanistically load-bearing, and trainable causal interventions are needed to adjudicate. An alternative approach pyvene could have used for the gender localization study is Amnesic Probing (Elazar et al., 2020), which uses concept erasure rather than subspace rotation to test causal necessity. A critical reader would push back on the scope of the empirical validation: both case studies are replications or simple demonstrations rather than novel findings, so the library's claimed advantage—handling complex intervention schemes more easily than predecessors—is never stress-tested against a task that those predecessors genuinely fail on at scale. The paper implicitly predicts that the probe-versus-intervention divergence observed in Pythia-6.9B will generalize, motivating pyvene as infrastructure for systematically testing that hypothesis across architectures and tasks, but no such evidence is provided here.

Methods (8)

Amnesic Probing
Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
Boundless DAS
A variant of DAS implemented in pyvene via BoundlessRotatedSpaceIntervention, introduced by Wu et al. 2023
Causal Scrubbing
Method by Chan et al. 2022 for rigorously testing interpretability hypotheses via interventions
Causal Structural Probe
Probe method combining causal interventions and structural analysis, supported by pyvene's activation collection
Distributed Alignment Search
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Inference-Time Intervention (ITI)
Method by Li et al. 2023a that adds static vectors to model activations at inference time to steer behavior
Interchange Intervention Training (IIT)
Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
Path Patching
Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions

Frameworks (2)

Backpack Language Models
LM architecture with sense vectors showing multiplication effects, illustrating custom intervention in pyvene
transformer architecture
Neural network architecture based on attention, commonly used in large language models

Findings (5)

'Italian' is among the top five returned logits after parallel multi-source interchange intervention mixing language and Italy activations in GPT-2
Demonstrates semantic mixing via parallel interventions producing expected composite outputs
DAS trainable intervention finds sparser gender representations across layers compared to linear probe in Pythia-6.9B
Case Study II result showing DAS identifies fewer causally relevant positions than a probe
pyvene reproduces Meng et al. 2022 Figure 1 (factual association localization in GPT2-XL) in about 20 lines of code
Case Study I demonstrating pyvene can replicate a major interpretability result compactly
Linear probe achieves 100% classification accuracy for almost all components in Pythia-6.9B gender task
Demonstrates that linear probes can overestimate causal relevance; probes succeed on non-causally-relevant representations
Pythia-6.9B achieves 100% accuracy on gendered pronoun prediction task
Baseline result confirming the model has fully learned the gender prediction task before probing

Claims (6)

Existing intervention libraries are often project-based, lack extensibility, are hard to maintain and share, and are limited to single or non-nested interventions on Transformers
Motivation claim contrasting pyvene with prior tools like BauKit, TransformerLens, nnsight, graphpatch
pyvene provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others
Core design claim of the pyvene paper summarizing its contribution over existing tools
The intervention is the basic primitive of pyvene, specified with a dict-based format rather than expressed as code executed at runtime
Design philosophy claim distinguishing pyvene's approach from prior libraries
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverage
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
A vanilla hook-based approach, as in all previous libraries, fails to intervene on any recurrent or state-space model
Technical claim justifying pyvene's state-variable hook tracking for recurrent model support
A probe may achieve high performance even on representations that are not causally relevant for the task
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance

Hypotheses (1)

We hypothesize that intervention efficiency can be scaled with multi-node and multi-GPU training as language models grow larger
Future work hypothesis about scaling pyvene's computational efficiency for very large models

Questions (2)

Are high-accuracy probe representations also causally relevant for the task?
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
Where and how is information stored in model-internal representations?
Core question motivating interchange intervention and interpretability research supported by pyvene

Original abstract (expand)

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce $\textbf{pyvene}$, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. $\textbf{pyvene}$ supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how $\textbf{pyvene}$ provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at https://github.com/stanfordnlp/pyvene.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
cited
in corpus
2023
≈ 75%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 80%
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Xu Wang, Yaoning Wang, Yu Wan, Yubo Ma, Baosong Yang, Haoran Wei, Jialong Tang, Huan Lin, Ruize Gao, Tianhao Li, Qian Cao, Xuancheng Ren, Xiaodong Deng, An Yang, Fei Huang, Dayiheng Liu, Jingren Zhou Boyi Deng
2026
≈ 77%
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
Xiangyu Liu and Haodi Lei and Yi Liu and Yang Liu and Wei Hu
2026
≈ 77%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 77%
A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
Rajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy
2026
≈ 77%
ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation
Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Gan, Andy Zhai, Qingxuan Chen, Zunnan Xu, Hao Wang, Jincheng Yu, Lucy Liang, Qian Wang, Ivan Laptev, Ian D Reid, Xiaodan Liang Yu Sun
2026
≈ 77%
A Timeline and Analysis for Representation Plasticity in Large Language Models
Akshat Kannan
2024
≈ 77%
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning
Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, Ruihai Wu, Baoxiong Jia, Carlo Sferrazza, Hao Dong, Siyuan Huang, Yue Wang, Jitendra Malik, Pieter Abbeel Haoran Geng
2025
≈ 77%
Patches of Nonlinearity: Instruction Vectors in Large Language Models
Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych Irina Bigoulaeva
2026
≈ 77%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 77%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 77%
SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
Vegard Flovik
2026
≈ 77%
Interpreting Language Model Parameters
in corpus
2026
≈ 77%
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
Xianxin Lai, Weiyu Chen, Xiao-Ping Zhang, and Jiayu Chen Ziyi Ding
2026
≈ 76%
Model Alignment Search
in corpus
2025
≈ 76%
The Platonic Representation Hypothesis
in corpus
2024
≈ 76%
Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models
Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li Bowei Tian
2025
≈ 76%
EVA: Towards a universal model of the immune system
Vincent Bouget, Apolline Bruley, Yannis Cattan, Charlotte Claye, Matthew Corney, Julien Duquesne, Karim El Kanbi, Aziz Fouch\'e, Pierre Marschall, Francesco Strozzi Scienta Team: Ethan Bandasack
2026
≈ 76%
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis Harrish Thasarathan
2026
≈ 76%
PALMS: A Computational Implementation for Pavlovian Associative Learning Models' Simulation
Alessandro Abati, Juli\'an Jim\'enez Nimmo, Sean Lim and Esther Mondrag\'on Martin Fixman
2026
≈ 76%
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Yuzhang Luo, Liangming Pan Jianhui Chen
2026
≈ 76%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 76%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 76%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 76%
Alignment faking in large language models
in corpus
2024
≈ 75%
Neural natural language inference models partially embed theories of lexical entailment and negation
cited
2020
≈ 71%
Inference-time intervention: eliciting truthful answers from a language model
cited
2023
≈ 66%
Dissecting recall of factual associations in auto-regressive language models
cited
2023
≈ 64%
Causal abstractions of neural networks
cited
2021
≈ 61%

+16 more

Similar preprints — Semantic Scholar

Cited by (3)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean