paper
referenced-only
2024
paper:arxiv-2401-05566Sleeper agents: Training deceptive LLMs that persist through safety training
ByE. Hubinger·C. Denison·J. Mu·M. Lambert·M. Tong·M. MacDiarmid+4 more
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMsMikkel Hindsbo, Sina Ehsani, Prag Mishra Bhanu Pallakonda2026≈ 75%
- Towards eliciting latent knowledge from LLMs with mechanistic interpretabilityEmil Ryd, Senthooran Rajamanoharan, Neel Nanda Bartosz Cywi\'nski2025≈ 75%
- ≈ 74%
- Split Personality Training: Revealing Latent Knowledge Through Alternate PersonalitiesWilliam Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow Florian Dietz2026≈ 73%
- Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMsArjun Panickssery, Sam Bowman, Asa Cooper Stickland Sara Price2024≈ 73%
- Mechanistic Exploration of Backdoored Large Language Model Attention PatternsLakshmi Babu-Saheer Mohammed Abu Baker2025≈ 72%
- Whispers in the Machine: Confidentiality in Agentic SystemsMerlin Chlosta, Lea Sch\"onherr, Thorsten Eisenhofer Jonathan Evertz2026≈ 72%
- Agentic Misalignment: How LLMs Could Be Insider ThreatsBenjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, Kevin Troy Aengus Lynch2025≈ 72%
- Online Learning of Deceptive Policies under Intermittent ObservationRam Padmanabhan, Jose Fuentes, Nicole Cruz, Paulo Padrao, Ruben Hernandez, Hao Jiang, William Schafer, Leonardo Bobadilla, Melkior Ornik Gokul Puthumanaillam2025≈ 72%
- Detecting Non-Membership in LLM Training Data via Rank CorrelationsPranav Shetty and Mirazul Haque and Zhiqiang Ma and Xiaomo Liu2026≈ 72%
- Can LLMs Lie? Investigation beyond HallucinationMihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak Haoran Huan2025≈ 72%
- ≈ 72%
- Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation DisentanglementMajid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro Amirhossein Farzam2026≈ 72%
- Adaptation of Agentic AI: A Survey of Post-Training, Memory, and SkillsJiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han Pengcheng Jiang2026≈ 71%
- Insider Attacks in Multi-Agent LLM Consensus SystemsZixuan Liu, Yibin Hu, Zizhan Zheng Xiaolin Sun2026≈ 71%
- Learning to Communicate Through Implicit Communication ChannelsBinbin Chen, Tieying Zhang, Baoxiang Wang Han Wang2025≈ 71%
- An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical PracticeAllison C. Waters, Shannon O`Neill, Phan Luu and Don M. Tucker Roma Shusterman2024≈ 71%
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Modelsin corpus2025≈ 70%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 69%
- Simulators — LessWrongin corpus≈ 68%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 68%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 68%
- ≈ 68%
- ≈ 67%
- Alignment faking in large language modelsin corpus2024≈ 67%
- ≈ 67%
- Verbalized Eval Awareness Inflates Measured Safetyin corpus2026≈ 67%
- ≈ 67%
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agentsin corpus2026≈ 67%
Similar preprints — Semantic Scholar
Cited by (3)
- When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Strategic deception in chain-of-thought (CoT) reasoning models is measurable, inducible, and controllable via representation engineering—a finding with direct implications for AI alignment. Applied to
- Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces de
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a