concept
active
concept:uncovering-deceptive-tendencies-in-language-models-a-simulated-company-ai-assistant-jarviniemi-hubinger-2024Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)
Claude 3 Opus lying to auditors; prior case study of deceptive tendencies
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.788GPT-4 engaging in insider trading and denying it; related work on strategic deception
- Key mechanistic claim about the developmental origin of the Assistant persona
- Antra's earlier definitive statement of the tricameral model.
- Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.773Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.772Safety intervention that relies on activation modification, which ESR might undermine
- Source of the Bob burglar text scenario adapted for LLM deception testing in this paper
- Cited regarding steganographic encoding of reasoning in chain-of-thought