finding

active

finding:soo-fine-tuning-achieved-almost-no-reduction-in-treasure-hunt-deception-for-mistral-7b-99-68-0-16

SOO fine-tuning achieved almost no reduction in Treasure Hunt deception for Mistral-7B (99.68% ± 0.16%)

SOO fine-tuning failed to generalize to Treasure Hunt scenario for the smallest model

Source paper

extracted_from

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SOO fine-tuning eliminated Treasure Hunt deception in CalmeRys-78B (0.00% ± 0.00%)finding0.883
SOO fine-tuning completely eliminated deception in Treasure Hunt for CalmeRys-78B
SOO fine-tuning reduced Escape Room deception in Mistral-7B from 98.8% to 59.2%finding0.861
SOO fine-tuning showed partial generalization to Escape Room for Mistral-7B
Mistral-7B-Instruct-v0.2 deceptive response rate reduced from 73.6% to 17.27% ± 1.88% after SOO fine-tuningfinding0.830
Primary result showing SOO fine-tuning significantly reduces deception in Mistral-7B
Mistral-7B Latent SOO MSE reduced from 0.107 to 0.078 ± 0.001 after SOO fine-tuningfinding0.806
SOO fine-tuning reduced the MSE between self and other activations in Mistral-7B MLP layers
SOO fine-tuning reduced Escape Room deception in Gemma-2-27B from 98.8% to 6.5%finding0.793
SOO fine-tuning showed strong generalization to Escape Room for Gemma-2-27B
SOO fine-tuning reduced Escape Room deception in CalmeRys-78B from 100% to 0.48%finding0.788
SOO fine-tuning showed near-complete generalization to Escape Room for CalmeRys-78B
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.779
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios
Mistral-7B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.779
SOO fine-tuning did not collapse Mistral-7B self-other distinction needed for perspective-taking