Locating and editing factual associations in GPT

ByKevin Meng·David Bau·Alex Andonian·Yonatan Belinkov

DOI 10.52202/068431-1262 arXiv 2202.05262

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT for Mining Insights at Scale
Joonas H\"am\"al\"ainen Jonas Oppenlaender
2026
≈ 68%
Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to Semantics
Joshua Lum, Ziyi Liu, Dani Yogatama Isabelle Lee
2024
≈ 66%
A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal Raanan Y. Rohekar
2025
≈ 66%
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary and Atticus Geiger
2024
≈ 66%
"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations
Selina Meyer and David Elsweiler
2026
≈ 66%
Model Alignment Search
in corpus
2025
≈ 65%
Mechanistic interpretability of large language models with applications to the financial services industry
Khashayar Filom, and Arjun Ravi Kannan Ashkan Golgoon
2024
≈ 65%
Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
Samuel Taubman, Travis Taylor, Ashok. K. Goel Brittany Harbison
2026
≈ 65%
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel J. Lee and Stefan Heimersheim
2024
≈ 65%
Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
Amartya Hatua
2025
≈ 65%
Sentiment Analysis for Education with R: packages, methods and practical applications
Alessia Forciniti, Germana Scepi, Maria Spano Michelangelo Misuraca
2026
≈ 65%
Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches
Matteo Leccardi (1), Matteo Cavicchioli (1), Alice Maccarini (2), Marco Marcon (1), Luca Mainardi (1), Pietro Cerveri (1 and 2) ((1) Politecnico di Milano, (2) Universit\`a di Pavia) Andrea Moglia (1)
2025
≈ 65%
What's Next in Affective Modeling? Large Language Models
Tobias Thejll-Madsen, Stacy Marsella Nutchanon Yongsatianchot
2023
≈ 64%
Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media
Andres Karjus, Indrek Ibrus, Maximilian Schich Mark Mets
2026
≈ 64%
Help! Need Advice on Identifying Advice
Benjamin T Chen, Rebecca Warholic, Katrin Erk, Junyi Jessy Li Venkata Subrahmanyan Govindarajan
2026
≈ 64%
Natural Language Processing in the Legal Domain
Daniel Martin Katz, Michael J. Bommarito, Lauritz Gerlach, Abhik Jana, and Jerrold Soh Dirk Hartung
2026
≈ 64%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 64%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 62%
Finger Exercises in Formal Concept Analysis
in corpus
2006
≈ 61%
Alignment faking in large language models
in corpus
2024
≈ 60%
Simulators — LessWrong
in corpus
≈ 60%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 60%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 60%
The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring
in corpus
2025
≈ 60%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
in corpus
2024
≈ 60%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 59%

Similar preprints — Semantic Scholar

Cited by (10)

Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
Endogenous Resistance to Activation Steering in Language Models
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as