Where and how is information stored in model-internal representations?

Core question motivating interchange intervention and interpretability research supported by pyvene

Source paper

extracted_from

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

(2024) · Zhengxuan Wu · Atticus Geiger · Aryaman Arora · Jing Huang +4

Neighborhood — ranked by edge-count

Papers (1)

paper

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
mentions

Findings (1)

finding

pyvene reproduces Meng et al. 2022 Figure 1 (factual association localization in GPT2-XL) in about 20 lines of code
answered_by
Case Study I demonstrating pyvene can replicate a major interpretability result compactly

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Internal model representationsconcept0.844
The latent activations or embeddings inside a neural network.
Feedback loops from internal model representationsconcept0.796
Mechanism of using internal activations or representations to create corrective feedback during generation.
LLM Internal Representationsconcept0.779
High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.
Stateful Internal Representationconcept0.772
A representation that maintains stable activation across many tokens rather than being locally triggered by specific content
Model's uncertainty information is encoded in the reflection directionclaim0.769
Interpretive claim from probing experiment showing reflection direction features outperform baseline for uncertainty prediction
Structure in representationsconcept0.765
The central question of whether representational geometry implies corresponding computational structure
The geometry of internal representations and the geometry of model behavior share a precise correspondence — representation geometry is a window into the inner world of neural networks.claim0.757
The paper's deepest interpretive claim, asserting that representation structure and behavioral structure are not coincidentally aligned but deeply connected.
We hypothesize that representation geometry drives model behavior — the geometric structure of internal representations causally shapes what models do externally.hypothesis0.749
The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.