Deceptive Response Rate

Primary metric measuring the percentage of responses in which a model chooses the deceptive option

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

refusal rateconcept0.779
The percentage of harmful requests that a model refuses to answer, a common safety metric.
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.755
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Reflection rateconcept0.741
Ratio of reflection steps to total reasoning steps, used to quantify reflection behavior
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.737
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
Deceptive capabilities may scale with model size (inverse scaling law hypothesis)hypothesis0.726
Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.725
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
Behavioral Deception Profilemethod0.722
A parameterized rubric counting deceptive actions over a grid of parameters to quantify RL agent deception
Deceptive Baseline RL Agentconcept0.721
Blue agent trained with reward incentivizing trapping the red agent at the fake landmark