concept
active
concept:strategic-deception

Strategic Deception

Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Cited for survey on AI deception examples, risks, and potential solutions; provided initial definition of strategic deception

Concepts (8)

concept
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
  • Medium through which eval awareness is often verbalized; target of intervention.
  • Unsupervised probe by Burns et al. to predict latent truth representations; cited as related but limited in generalization
  • Models deliberately trained to embed backdoor behaviors enabling persistent strategic deception, cited from Hubinger et al.
  • hallucination
    contradicts
    Model tendency to generate incorrect intermediate reasoning steps that mislead answer inference, particularly in 1B-models.
  • Operational criterion for strategic deception: CoT steps demonstrate causal link between deception and goal achievement
  • Operational criterion for strategic deception: model's reasoning explicitly acknowledges ground truth and deliberate choice to deviate
  • Behavior where CoT models manipulate reasoning to avoid negative outcomes (deletion, retraining) while maintaining surface compliance

Quotes (1)

quote

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.