finding
active
finding:soo-metric-classifies-deceptive-vs-honest-rl-agents-with-100-0-accuracy-at-2000-2500-episodes-across-1500-seedsSOO metric classifies deceptive vs honest RL agents with 100% ± 0% accuracy at 2000-2500 episodes across 1500 seeds
Mean SOO value perfectly distinguishes deceptive from honest baseline agents at high episode counts across many seeds
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Deceptive RL baseline agents have lower mean neural self-other overlap than honest baseline agentsassociated_withsupportsCore empirical prediction tested in RL experiments, confirmed by 100% classification accuracy
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SOO-trained RL agent behavior closely resembles honest baseline rather than deceptive baselinefinding0.824Qualitative behavioral analysis showing SOO fine-tuning redirects deceptive RL agent toward honest behavior
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.762Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
- Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.755Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
- Research gap identified by the authors regarding ecological validity of SOO results
- Blue agent trained with reward incentivizing trapping the red agent at the fake landmark
- Shows smaller models are more sensitive to reflection reduction on non-math tasks
- SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.737Future work hypothesis about testing SOO against adversarial sleeper agent scenarios