Dissecting recall of factual associations in auto-regressive language models

ByMor Geva·Jasmijn Bastings·Katja Filippova·Amir Globerson

DOI 10.48550/arxiv.2304.14767 arXiv 2304.14767

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models
Yanxu Chen, Sander Hoffman, Maria Heuss Dante Campregher
2025
≈ 75%
Across the Levels of Analysis: Explaining Predictive Processing in Humans Requires More Than Machine-Estimated Probabilities
Sathvik Nair and Colin Phillips
2026
≈ 75%
DataDignity: Training Data Attribution for Large Language Models
Andrzej Banburski-Fahey, Jaron Lanier Xiaomin Li
2026
≈ 75%
Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Aaron Mueller Zhengyang Shan
2025
≈ 75%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 75%
Perceptions of Linguistic Uncertainty by Language Models and Humans
Markelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth Catarina G Belem
2024
≈ 74%
Evaluating Large Language Models with Psychometrics
Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun Yuan Li
2025
≈ 74%
Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri
2026
≈ 74%
Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models
Mariam Mahran Katharina Simbeck
2025
≈ 74%
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu
2025
≈ 74%
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Zubair Bashir, Procheta Sen Bhavik Chandna
2025
≈ 74%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 74%
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov Daniil Laptev
2025
≈ 74%
Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text
Ruizhe Li (2), and Elspeth Edelstein (3) ((1) Language Centre, School of Language, Literature, Music and Visual Culture, University of Aberdeen, United Kingdom, (2) School of Natural and Computing Sciences, University of Aberdeen, United Kingdom, (3) School of Language, Literature, Music and Visual Culture, University of Aberdeen, United Kingdom) Paul Jackson (1)
2026
≈ 74%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 74%
Language Models "Grok" to Copy
Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan Ang Lv
2025
≈ 74%
Reanalyzing L2 Preposition Learning with Bayesian Mixed Effects and a Pretrained Language Model
Jakob Prange and Man Ho Ivy Wong
2026
≈ 74%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 73%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 73%
Model Alignment Search
in corpus
2025
≈ 72%
Interpreting Language Model Parameters
in corpus
2026
≈ 72%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 71%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 71%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 70%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 70%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 70%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 70%
Persistence and Introspection of Emotion Features
in corpus
≈ 70%

Similar preprints — Semantic Scholar

Cited by (4)

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as