paper:arxiv-2202-05262Locating and editing factual associations in GPT
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT for Mining Insights at ScaleJoonas H\"am\"al\"ainen Jonas Oppenlaender2026≈ 68%
- Causal Interventions on Causal Paths: Mapping GPT-2's Reasoning From Syntax to SemanticsJoshua Lum, Ziyi Liu, Dani Yogatama Isabelle Lee2024≈ 66%
- A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled EnvironmentYaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal Raanan Y. Rohekar2025≈ 66%
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 SmallMaheep Chaudhary and Atticus Geiger2024≈ 66%
- "You tell me": A Dataset of GPT-4-Based Behaviour Change Support ConversationsSelina Meyer and David Elsweiler2026≈ 66%
- Model Alignment Searchin corpus2025≈ 65%
- Mechanistic interpretability of large language models with applications to the financial services industryKhashayar Filom, and Arjun Ravi Kannan Ashkan Golgoon2024≈ 65%
- Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in MatchmakingSamuel Taubman, Travis Taylor, Ashok. K. Goel Brittany Harbison2026≈ 65%
- Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEsDaniel J. Lee and Stefan Heimersheim2024≈ 65%
- Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment AnalysisAmartya Hatua2025≈ 65%
- Sentiment Analysis for Education with R: packages, methods and practical applicationsAlessia Forciniti, Germana Scepi, Maria Spano Michelangelo Misuraca2026≈ 65%
- Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific ApproachesMatteo Leccardi (1), Matteo Cavicchioli (1), Alice Maccarini (2), Marco Marcon (1), Luca Mainardi (1), Pietro Cerveri (1 and 2) ((1) Politecnico di Milano, (2) Universit\`a di Pavia) Andrea Moglia (1)2025≈ 65%
- What's Next in Affective Modeling? Large Language ModelsTobias Thejll-Madsen, Stacy Marsella Nutchanon Yongsatianchot2023≈ 64%
- Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news mediaAndres Karjus, Indrek Ibrus, Maximilian Schich Mark Mets2026≈ 64%
- Help! Need Advice on Identifying AdviceBenjamin T Chen, Rebecca Warholic, Katrin Erk, Junyi Jessy Li Venkata Subrahmanyan Govindarajan2026≈ 64%
- Natural Language Processing in the Legal DomainDaniel Martin Katz, Michael J. Bommarito, Lauritz Gerlach, Abhik Jana, and Jerrold Soh Dirk Hartung2026≈ 64%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 64%
- ≈ 62%
- Finger Exercises in Formal Concept Analysisin corpus2006≈ 61%
- Alignment faking in large language modelsin corpus2024≈ 60%
- Simulators — LessWrongin corpus≈ 60%
- ≈ 60%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 60%
- ≈ 60%
- ≈ 60%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 59%
Similar preprints — Semantic Scholar
Cited by (10)
- Addressing divergent representations from causal interventions on neural networks
Causal intervention methods central to mechanistic interpretability—including activation patching, mean-difference vector patching, Sparse Autoencoders, and Distributed Alignment Search (DAS)—systemat
- Model Alignment Search
Model Alignment Search (MAS) establishes bidirectional causal similarity between neural networks by learning a per-model orthogonal rotation matrix that isolates behaviorally relevant subspaces and us
- pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Distributed alignment search (DAS) resolves two blocking limitations of prior causal abstraction work—brute-force alignment search and the localist assumption that high-level variables map to disjoint
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
- Endogenous Resistance to Activation Steering in Language Models
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as