finding

active

finding:gemma-3-1b-it-yields-only-one-valid-mds-injection-score-phi-1-a-up-4-8-and-is-excluded-from-main-analyses

gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analyses

Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation

Source paper

extracted_from

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Why do MDS injections fail on gemma-3-1b-it but succeed across all other tested LLMs?question0.814
Unexplained exception identified as a limitation and open question
PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3finding0.781
Per-model steerability comparison from Table 4
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.767
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.766
Small Gemma model shows severe ASR degradation at higher cone dimensions
MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%finding0.764
Primary quantitative result overturning prior reports that prompting outperforms representation engineering
Gemma 3 4B wellbeing probe: peak Cohen's d=1.8finding0.758
Weaker cross-family probe; explains weaker introspection in Gemma
Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injectionsclaim0.757
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.750
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures