gemma-3-1b-it

Only model where MDS injections largely failed; excluded from main analyses

Neighborhood — ranked by edge-count

paper

concept

Gemma-2-2B-it
related_to
Smallest Gemma model tested, showing near-zero ESR
Gemma-2-9B-it
related_to
Medium Gemma model tested, showing near-zero ESR
Gemma-3-4B-it
related_to
Backbone model used in E3 robustness overlay.
gemma-3-12b-it
related_to
12B Gemma model tested; used for openness linearity visualization (Figure 6)
gemma-3-27b-it
related_to
27B Gemma model quantized to 4-bit NF4; tested in OCEAN benchmarks

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.723
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures
GemmaScope SAEsconcept0.714
SAEs trained on pretrained Gemma-2 models used for steering in Gemma family experiments
Gemma 2: Improving Open Language Models at a Practical Size (Team et al., 2024)concept0.706
Paper describing Gemma 2 model family used in this study
Gemma 3 4B-IT wellbeing introspection: ρ=0.28, isotonic R²=0.11 (LMM p=1.33×10⁻¹³)finding0.701
Weaker but still significant introspective coupling in Gemma model; consistent with lower probe quality
gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analysesfinding0.697
Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.695
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Gemma 3 4B wellbeing probe: peak Cohen's d=1.8finding0.695
Weaker cross-family probe; explains weaker introspection in Gemma
Gemma 2 27B is unlikely to take on human personas when steered away from Assistant, preferring nonhuman or theatrical portrayalsfinding0.688
Model-specific difference in persona susceptibility