Direct preference optimization: Your language model is secretly a reward model

ByR. Rafailov·A. Sharma·E. Mitchell·C. D. Manning·S. Ermon·C. Finn

DOI 10.52202/075280-2338 arXiv 2305.18290

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
Leonard Dung Valen Tagliabue
2025
≈ 78%
Explanation through Reward Model Reconciliation using POMDP Tree Search
Anshu Saksena, Anna L. Buczak, Zachary N. Sunberg Benjamin D. Kraske
2026
≈ 77%
Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri
2026
≈ 76%
Aligning Large Language Models with Human Preferences through Representation Engineering
Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang Wenhao Liu
2024
≈ 76%
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu Weixu Zhang
2026
≈ 76%
Mitigating LLM biases toward spurious social contexts using direct preference optimization
Dorottya Demszky Hyunji Nam
2026
≈ 76%
DSO: Direct Steering Optimization for Bias Mitigation
Lucas Monteiro Paes and Nivedha Sivakumar and Yinong Oliver Wang and Masha Fedzechkina and Barry-John Theobald and Luca Zappella and Nicholas Apostoloff
2026
≈ 76%
Active Preference Inference using Language Models and Probabilistic Reasoning
Volodymyr Kuleshov, Kevin Ellis Wasu Top Piriyakulkij
2024
≈ 75%
Auxiliary task demands mask the capabilities of smaller language models
Michael C. Frank Jennifer Hu
2024
≈ 75%
Large Language Models Persuade Without Planning Theory of Mind
Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones Jared Moore
2026
≈ 75%
Perceptions of Linguistic Uncertainty by Language Models and Humans
Markelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth Catarina G Belem
2024
≈ 75%
Evaluating Language Model Agency through Negotiations
Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West Tim R. Davidson
2026
≈ 75%
Learning to Model the World with Language
Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan Jessy Lin
2024
≈ 75%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 75%
Steering Language Models with Weight Arithmetic
Fabien Roger Constanza Fierro
2026
≈ 75%
A Free energy principle for the brain (lecture summary)
in corpus
2008
≈ 73%
Active inference: demystified and compared
in corpus
2021
≈ 73%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 72%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 72%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 72%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 71%
Simulators — LessWrong
in corpus
≈ 71%
Alignment faking in large language models
in corpus
2024
≈ 71%
Interpreting Language Model Parameters
in corpus
2026
≈ 70%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 70%
The Platonic Representation Hypothesis
in corpus
2024
≈ 70%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 70%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 70%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 70%

Similar preprints — Semantic Scholar

Cited by (2)

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a