Inference-time intervention: eliciting truthful answers from a language model

ByKenneth Li·Oam Patel·Fernanda Viégas·Hanspeter Pfister·Martin Wattenberg

DOI 10.52202/075280-1797 arXiv 2306.03341

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Enhancing reasoning accuracy in large language models during inference time
Manish Jain Vinay Sharma
2026
≈ 76%
Modeling Human Behavior Part I -- Learning and Belief Approaches
Andrew Fuchs and Andrea Passarella and Marco Conti
2022
≈ 76%
Perceptions of Linguistic Uncertainty by Language Models and Humans
Markelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth Catarina G Belem
2024
≈ 75%
Active inference and artificial reasoning
Lancelot Da Costa, Alexander Tschantz, Conor Heins, Christopher Buckley, Tim Verbelen, Thomas Parr Karl Friston
2025
≈ 75%
Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri
2026
≈ 74%
When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design
Daehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park Soyoung Jung
2026
≈ 74%
Interactive inference: a multi-agent model of cooperative joint actions
Francesco Donnarumma, Giovanni Pezzulo Domenico Maisto
2024
≈ 74%
An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical Practice
Allison C. Waters, Shannon O`Neill, Phan Luu and Don M. Tucker Roma Shusterman
2024
≈ 74%
Language and Experience: A Computational Model of Social Learning in Complex Tasks
Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman, Jacob Andreas, Joshua Tenenbaum C\'edric Colas
2026
≈ 74%
Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and Causality
Almog Hillel, Ryan Truong, Vikash K. Mansinghka, Joshua B. Tenenbaum, Tan Zhi-Xuan Lance Ying
2025
≈ 74%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 74%
Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models
Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl Xingyuan Zhang
2023
≈ 74%
Explanations are a Means to an End: Decision Theoretic Explanation Evaluation
Berk Ustun, Jessica Hullman Ziyang Guo
2026
≈ 73%
Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
Yue Zhang, Jinku Li Rui Jiao
2025
≈ 73%
Reasoning Models Generate Societies of Thought
Shiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim
2026
≈ 73%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 72%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 72%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 71%
Active inference: demystified and compared
in corpus
2021
≈ 71%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 69%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 69%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 69%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 69%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 68%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 68%
Testing the Limits of Truth Directions in LLMs
in corpus
2026
≈ 68%
Interpreting Language Model Parameters
in corpus
2026
≈ 68%
Alignment faking in large language models
in corpus
2024
≈ 68%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 67%

Similar preprints — Semantic Scholar

Cited by (7)

Testing the Limits of Truth Directions in LLMs
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Endogenous Resistance to Activation Steering in Language Models
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie