Model Deception

LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation

Neighborhood — ranked by edge-count

concept

Adversarial Manipulation of Truthfulness
extends
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model Evidenceconcept0.825
Probability of data under the model, penalizing complexity and rewarding accuracy.
AI Deceptionconcept0.805
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
modelconcept0.800
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Apparent Deceptionconcept0.800
A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
Strategic Deceptionconcept0.795
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
Model Misalignmentconcept0.785
The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.
Deception correction via featuresconcept0.781
Using SAE features to detect and steer the model away from untruthful responses.
Model Surgerymethod0.770
Edits MLP weights for all layers to modify model behavior; used by Abdelnabi & Salem to decrease verbalized evaluation awareness.