method
active
method:deceptive-response-rateDeceptive Response Rate
Primary metric measuring the percentage of responses in which a model chooses the deceptive option
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The percentage of harmful requests that a model refuses to answer, a common safety metric.
- Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.755Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
- Ratio of reflection steps to total reasoning steps, used to quantify reflection behavior
- Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.737Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
- Cited hypothesis from Lin et al. 2022 suggesting larger models become more capable of deception
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- A parameterized rubric counting deceptive actions over a grid of parameters to quantify RL agent deception
- Blue agent trained with reward incentivizing trapping the red agent at the fake landmark