finding

active

finding:all-three-gemma-2-models-show-esr-rates-below-1-near-indistinguishable-from-zero

All three Gemma-2 models show ESR rates below 1%, near indistinguishable from zero

Establishes potential Llama-family specificity or scale specificity of ESR phenomenon

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Claims (1)

claim

We cannot isolate whether ESR reflects scale, architecture, or training procedures in Llama-3.3-70B
supports
Epistemic limitation claim acknowledging confounds in the cross-model comparison

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.832
Small Gemma model shows severe ASR degradation at higher cone dimensions
Gemma-2-9B achieves near-100% ASR (97.3–100%) across all cone dimensions 1–5finding0.809
Experiment 2 result showing large Gemma model supports high-dimensional truth cones
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.808
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.787
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models
Gemma-2-27B average generalization deceptive rate reduced from 98.4% ± 1.55% to 9.94% ± 6.83%finding0.769
SOO fine-tuning generalized across 7 scenario variants for Gemma-2-27B
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.768
Shows persona space axes are inherited from pre-training, not solely created by post-training
Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.767
SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking
Honesty prompting does not reduce Gemma-2-27B deception (100% vs 100% baseline)finding0.763
Directly prompting Gemma-2-27B to be honest had no effect on deceptive response rate