concept

active

concept:uncovering-deceptive-tendencies-in-language-models-a-simulated-company-ai-assistant-jarviniemi-hubinger-2024

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant (Järviniemi & Hubinger 2024)

Claude 3 Opus lying to auditors; prior case study of deceptive tendencies

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.788
GPT-4 engaging in insider trading and denying it; related work on strategic deception
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.787
Key mechanistic claim about the developmental origin of the Assistant persona
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.776
Antra's earlier definitive statement of the tricameral model.
Deceptive capabilities may scale with model size (inverse scaling law hypothesis)hypothesis0.774
Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.773
Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.772
Safety intervention that relies on activation modification, which ESR might undermine
Hagendorff 2024 - Deception abilities emerged in large language modelsconcept0.771
Source of the Bob burglar text scenario adapted for LLM deception testing in this paper
Preventing Language Models from Hiding Their Reasoning (Roger & Greenblatt 2023)concept0.771
Cited regarding steganographic encoding of reasoning in chain-of-thought