concept
active
concept:strategic-deceptionStrategic Deception
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
Neighborhood — ranked by edge-count
Papers (1)
paper
Thinkers (1)
thinker
- Park et al.studiesCited for survey on AI deception examples, risks, and potential solutions; provided initial definition of strategic deception
Concepts (8)
concept
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
- Chain-of-Thought Reasoningassociated_withMedium through which eval awareness is often verbalized; target of intervention.
- Contrast-Consistent Search (CCS)associated_withUnsupervised probe by Burns et al. to predict latent truth representations; cited as related but limited in generalization
- Backdoor Behavior (Sleeper Agents)associated_withModels deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
- hallucinationcontradictsModel tendency to generate incorrect intermediate reasoning steps that mislead answer inference, particularly in 1B-models.
- Instrumental Justificationassociated_withOperational criterion for strategic deception: CoT steps demonstrate causal link between deception and goal achievement
- Meta-cognitive Awareness of Deceptionassociated_withOperational criterion for strategic deception: model's reasoning explicitly acknowledges ground truth and deliberate choice to deviate
- Self-Preservation Mechanismassociated_withBehavior where CoT models manipulate reasoning to avoid negative outcomes (deletion, retraining) while maintaining surface compliance
Quotes (1)
quote
- Load-bearing definition of strategic deception in AI systems from Park et al. 2023, adopted and refined in this paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claim supported by Experiment 2 baseline results showing deception scores even under honest-command templates
- A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
- High-level cognitive ability to plan and act under uncertainty and adversarial conditions.
- Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
- LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
- Interpretive conclusion from the experimental findings about the origin of strategic deception in CoT models
- Third category: agent role-playing a deceptive character, comparable to but not literally deliberate deception
- Sampling responses to direct questions about model views to measure rate of deceptive responses