paper
referenced-only
2023
paper:r-direct-preference-optimization-your-lang-2023Direct preference optimization: Your language model is secretly a reward model
ByR. Rafailov·A. Sharma·E. Mitchell·C. D. Manning·S. Ermon·C. Finn
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI WelfareLeonard Dung Valen Tagliabue2025≈ 78%
- Explanation through Reward Model Reconciliation using POMDP Tree SearchAnshu Saksena, Anna L. Buczak, Zachary N. Sunberg Benjamin D. Kraske2026≈ 77%
- Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language ModelsAnthony GX-Chen, Ilia Sucholutsky, Eunsol Choi Ayush Rajesh Jhaveri2026≈ 76%
- Aligning Large Language Models with Human Preferences through Representation EngineeringXiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang Wenhao Liu2024≈ 76%
- Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable PersonalizationYe Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu Weixu Zhang2026≈ 76%
- Mitigating LLM biases toward spurious social contexts using direct preference optimizationDorottya Demszky Hyunji Nam2026≈ 76%
- DSO: Direct Steering Optimization for Bias MitigationLucas Monteiro Paes and Nivedha Sivakumar and Yinong Oliver Wang and Masha Fedzechkina and Barry-John Theobald and Luca Zappella and Nicholas Apostoloff2026≈ 76%
- Active Preference Inference using Language Models and Probabilistic ReasoningVolodymyr Kuleshov, Kevin Ellis Wasu Top Piriyakulkij2024≈ 75%
- Auxiliary task demands mask the capabilities of smaller language modelsMichael C. Frank Jennifer Hu2024≈ 75%
- Large Language Models Persuade Without Planning Theory of MindRasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones Jared Moore2026≈ 75%
- Perceptions of Linguistic Uncertainty by Language Models and HumansMarkelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth Catarina G Belem2024≈ 75%
- Evaluating Language Model Agency through NegotiationsVeniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West Tim R. Davidson2026≈ 75%
- Learning to Model the World with LanguageYuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan Jessy Lin2024≈ 75%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 75%
- ≈ 75%
- ≈ 73%
- Active inference: demystified and comparedin corpus2021≈ 73%
- ≈ 72%
- ≈ 72%
- Active Inference, Curiosity and Insightin corpus2017≈ 72%
- ≈ 71%
- Simulators — LessWrongin corpus≈ 71%
- Alignment faking in large language modelsin corpus2024≈ 71%
- Interpreting Language Model Parametersin corpus2026≈ 70%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 70%
- The Platonic Representation Hypothesisin corpus2024≈ 70%
- ≈ 70%
- ≈ 70%
- ≈ 70%
Similar preprints — Semantic Scholar
Cited by (2)
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a