thinker
active
thinker:zheng-wang

Zheng Wang

Authored
1
Introduces
0
Studies
0
Affiliations
1
Cited by
3

Authored papers (1)

  • pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstraction, expressed in a serializable dict-based configuration that can be shared via HuggingFace. Prior libraries (BauKit, TransformerLens, nnsight, graphpatch, Transformer Debugger) either lack extensibility to recurrent and convolutional architectures or require sophisticated custom code for multi-source, cross-forward-pass interventions; pyvene resolves both limitations with Getter/Setter hooks that track state variables enabling intervention at arbitrary time steps in GRU and other recurrent models. The library ships with trainable intervention types including RotatedSpaceIntervention (Distributed Alignment Search), LowRankRotatedSpaceIntervention, and BoundlessRotatedSpaceIntervention, and reproduces Meng et al.'s factual-association localization result in GPT2-XL in approximately 20 lines of code. A second case study on Pythia-6.9B demonstrates that a 1D DAS intervention finds sparse, causally localized gender representations across layers, whereas a linear probe achieves near-100% classification accuracy almost everywhere—implying that high probe accuracy is insufficient evidence of causal relevance, and that trainable interventions provide a strictly more diagnostic test of whether a representation is mechanistically load-bearing for a behavior.

More papers — OpenAlex / S2

Affiliations (1)

Co-authors (7)

Recent mentions (1)