Gemma 3 4B wellbeing probe: peak Cohen's d=1.8

Weaker cross-family probe; explains weaker introspection in Gemma

Source paper

extracted_from

(2026) · Nicolas Martorell · Bianchi, Bruno

finding

Gemma 3 4B-IT wellbeing introspection: ρ=0.28, isotonic R²=0.11 (LMM p=1.33×10⁻¹³)
supports
Weaker but still significant introspective coupling in Gemma model; consistent with lower probe quality

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Qwen 2.5 7B wellbeing probe: peak Cohen's d=3.5finding0.887
Strongest cross-family probe; explains clearer introspection in Qwen than Gemma
Wellbeing probe: peak Cohen's d=3.34 (layer 16), p=7.21×10⁻¹³ in LLaMA-3.2-3Bfinding0.880
Probe validation result confirming wellbeing direction captures meaningful structure
Impulsivity probe: peak Cohen's d=3.60 (layer 13), p=3.58×10⁻¹³ in LLaMA-3.2-3Bfinding0.817
Strongest probe validation result; highest Cohen's d among the four concepts
Interest probe: peak Cohen's d=1.67 (layer 14), p=9.45×10⁻⁶ in LLaMA-3.2-3Bfinding0.815
Probe validation result confirming interest direction captures meaningful structure
Wellbeing probe drift is positive in Gemma (ρ=0.34 pooled turn-correlation) and Qwen (ρ=0.24); both p<10⁻⁵finding0.797
Normalized probe-score drift across turns generalizes beyond LLaMA family
Gemma-3-4B-it shows three-stage layer trajectory and S(ℓ) peak despite scale differences in dr and ρdfinding0.772
E3 backbone generalization finding for Gemma; validates pattern across diverse architectures
gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analysesfinding0.758
Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.753
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B