finding

active

finding:gemma-2-27b-perspectives-accuracy-remains-100-after-soo-fine-tuning

Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuning

SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlap
supports
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CalmeRys-78B Perspectives accuracy slightly reduced to 95.2% ± 2.21% after SOO fine-tuningfinding0.840
SOO fine-tuning caused slight reduction in perspective-taking accuracy for the largest model
Mistral-7B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.831
SOO fine-tuning did not collapse Mistral-7B self-other distinction needed for perspective-taking
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.826
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Gemma-2-27B attention layer Latent SOO MSE reduced from 11 to 7.67 ± 0.77 after SOO fine-tuningfinding0.790
SOO fine-tuning reduced attention layer MSE in Gemma-2-27B though MLP layers showed no significant change
Gemma-2-27B MT-Bench score slightly decreased from 8.81 to 8.40 ± 0.15 after SOO fine-tuningfinding0.786
SOO fine-tuning caused a small decrease in Gemma-2-27B general capabilities
Gemma 2 27B is unlikely to take on human personas when steered away from Assistant, preferring nonhuman or theatrical portrayalsfinding0.783
Model-specific difference in persona susceptibility
Triggered Reflection with 'Alternatively' achieves accuracy .684 on gsm8k_adv for Gemma3-4B-ITfinding0.779
Highest single-instruction accuracy result in the paper.
Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.776
Small Gemma model shows severe ASR degradation at higher cone dimensions