Gemma-3-4B-it

Backbone model used in E3 robustness overlay.

Neighborhood — ranked by edge-count

method

E3: Layer-wise Geometric Trajectory Analysis
uses
Quantitative study correlating layer-wise anchoring geometry (S_max, AUS_N) with behavioral thresholds θ50

concept

Gemma-2-2B-it
related_to
Smallest Gemma model tested, showing near-zero ESR
Gemma-2-9B-it
related_to
Medium Gemma model tested, showing near-zero ESR
gemma-3-1b-it
related_to
Only model where MDS injections largely failed; excluded from main analyses
gemma-3-12b-it
related_to
12B Gemma model tested; used for openness linearity visualization (Figure 6)
gemma-3-27b-it
related_to
27B Gemma model quantized to 4-bit NF4; tested in OCEAN benchmarks

finding

Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)
associated_with
Task-specific E3 finding showing compositional reasoning requires deeper processing

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.757
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures
Gemma 3 4B-IT wellbeing introspection: ρ=0.28, isotonic R²=0.11 (LMM p=1.33×10⁻¹³)finding0.745
Weaker but still significant introspective coupling in Gemma model; consistent with lower probe quality
Gemma 2: Improving Open Language Models at a Practical Size (Team et al., 2024)concept0.743
Paper describing Gemma 2 model family used in this study
Gemma 3 4B wellbeing probe: peak Cohen's d=1.8finding0.736
Weaker cross-family probe; explains weaker introspection in Gemma
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.727
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.725
Small Gemma model shows severe ASR degradation at higher cone dimensions
GemmaScope SAEsconcept0.724
SAEs trained on pretrained Gemma-2 models used for steering in Gemma family experiments
Gemma-2-27B Perspectives accuracy remains 100% after SOO fine-tuningfinding0.722
SOO fine-tuning did not collapse Gemma-2-27B self-other distinction needed for perspective-taking