paper:li-inference-time-intervention-eliciting-tr-2023Inference-time intervention: eliciting truthful answers from a language model
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Enhancing reasoning accuracy in large language models during inference timeManish Jain Vinay Sharma2026≈ 76%
- Modeling Human Behavior Part I -- Learning and Belief ApproachesAndrew Fuchs and Andrea Passarella and Marco Conti2022≈ 76%
- Perceptions of Linguistic Uncertainty by Language Models and HumansMarkelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth Catarina G Belem2024≈ 75%
- Active inference and artificial reasoningLancelot Da Costa, Alexander Tschantz, Conor Heins, Christopher Buckley, Tim Verbelen, Thomas Parr Karl Friston2025≈ 75%
- Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language ModelsAnthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri2026≈ 74%
- When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI DesignDaehoo Yoon, Sung Gyu Koh, Young Hwan Kim, Yehan Ahn, Sung Park Soyoung Jung2026≈ 74%
- Interactive inference: a multi-agent model of cooperative joint actionsFrancesco Donnarumma, Giovanni Pezzulo Domenico Maisto2024≈ 74%
- An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical PracticeAllison C. Waters, Shannon O`Neill, Phan Luu and Don M. Tucker Roma Shusterman2024≈ 74%
- Language and Experience: A Computational Model of Social Learning in Complex TasksTracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman, Jacob Andreas, Joshua Tenenbaum C\'edric Colas2026≈ 74%
- Belief Attribution as Mental Explanation: The Role of Accuracy, Informativity, and CausalityAlmog Hillel, Ryan Truong, Vikash K. Mansinghka, Joshua B. Tenenbaum, Tan Zhi-Xuan Lance Ying2025≈ 74%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 74%
- Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World ModelsPhilip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl Xingyuan Zhang2023≈ 74%
- Explanations are a Means to an End: Decision Theoretic Explanation EvaluationBerk Ustun, Jessica Hullman Ziyang Guo2026≈ 73%
- Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought ProcessesYue Zhang, Jinku Li Rui Jiao2025≈ 73%
- Reasoning Models Generate Societies of ThoughtShiyang Lai, Nino Scherrer, Blaise Ag\"uera y Arcas, James Evans Junsol Kim2026≈ 73%
- ≈ 72%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 72%
- Active Inference, Curiosity and Insightin corpus2017≈ 71%
- Active inference: demystified and comparedin corpus2021≈ 71%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 69%
- ≈ 69%
- ≈ 69%
- ≈ 69%
- Verbalized Eval Awareness Inflates Measured Safetyin corpus2026≈ 68%
- ≈ 68%
- Testing the Limits of Truth Directions in LLMsin corpus2026≈ 68%
- Interpreting Language Model Parametersin corpus2026≈ 68%
- Alignment faking in large language modelsin corpus2024≈ 68%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 67%
Similar preprints — Semantic Scholar
Cited by (7)
- Testing the Limits of Truth Directions in LLMs
Linear truth directions in LLMs are reliable primarily for simple factual retrieval and break down as soon as truth assessment requires tracking intermediate results—a finding that sharply constrains
- pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
- Endogenous Resistance to Activation Steering in Language Models
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie