Sleeper agents: Training deceptive LLMs that persist through safety training

ByE. Hubinger·C. Denison·J. Mu·M. Lambert·M. Tong·M. MacDiarmid+4 more

DOI 10.48550/arxiv.2401.05566 arXiv 2401.05566

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs
Mikkel Hindsbo, Sina Ehsani, Prag Mishra Bhanu Pallakonda
2026
≈ 75%
Towards eliciting latent knowledge from LLMs with mechanistic interpretability
Emil Ryd, Senthooran Rajamanoharan, Neel Nanda Bartosz Cywi\'nski
2025
≈ 75%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 74%
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities
William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow Florian Dietz
2026
≈ 73%
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Arjun Panickssery, Sam Bowman, Asa Cooper Stickland Sara Price
2024
≈ 73%
Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
Lakshmi Babu-Saheer Mohammed Abu Baker
2025
≈ 72%
Whispers in the Machine: Confidentiality in Agentic Systems
Merlin Chlosta, Lea Sch\"onherr, Thorsten Eisenhofer Jonathan Evertz
2026
≈ 72%
Agentic Misalignment: How LLMs Could Be Insider Threats
Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, Kevin Troy Aengus Lynch
2025
≈ 72%
Online Learning of Deceptive Policies under Intermittent Observation
Ram Padmanabhan, Jose Fuentes, Nicole Cruz, Paulo Padrao, Ruben Hernandez, Hao Jiang, William Schafer, Leonardo Bobadilla, Melkior Ornik Gokul Puthumanaillam
2025
≈ 72%
Detecting Non-Membership in LLM Training Data via Rank Correlations
Pranav Shetty and Mirazul Haque and Zhiqiang Ma and Xiaomo Liu
2026
≈ 72%
Can LLMs Lie? Investigation beyond Hallucination
Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan
2025
≈ 72%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 72%
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro Amirhossein Farzam
2026
≈ 72%
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han Pengcheng Jiang
2026
≈ 71%
Insider Attacks in Multi-Agent LLM Consensus Systems
Zixuan Liu, Yibin Hu, Zizhan Zheng Xiaolin Sun
2026
≈ 71%
Learning to Communicate Through Implicit Communication Channels
Binbin Chen, Tieying Zhang, Baoxiang Wang Han Wang
2025
≈ 71%
An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical Practice
Allison C. Waters, Shannon O`Neill, Phan Luu and Don M. Tucker Roma Shusterman
2024
≈ 71%
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
in corpus
2025
≈ 70%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 69%
Simulators — LessWrong
in corpus
≈ 68%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 68%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 68%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 68%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 67%
Alignment faking in large language models
in corpus
2024
≈ 67%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 67%
Verbalized Eval Awareness Inflates Measured Safety
in corpus
2026
≈ 67%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 67%
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
in corpus
2026
≈ 67%

Similar preprints — Semantic Scholar

Cited by (3)

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces de
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a