GemmaScope SAEs

SAEs trained on pretrained Gemma-2 models used for steering in Gemma family experiments

Neighborhood — ranked by edge-count

thinker

Tom Lieberum
studies
Lead author of GemmaScope paper, providing the SAEs used for Gemma-2 models

dataset

Gemma-2-27B-it
associated_with
27B parameter LLM used in SOO fine-tuning experiments

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gemma-3-4B-itconcept0.724
Backbone model used in E3 robustness overlay.
Gemma-2-9B-itconcept0.720
Medium Gemma model tested, showing near-zero ESR
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 (Lieberum et al., 2024)concept0.714
Paper introducing GemmaScope SAEs used for Gemma-2 model experiments
gemma-3-1b-itconcept0.714
Only model where MDS injections largely failed; excluded from main analyses
Gemma-2-2B-itconcept0.711
Smallest Gemma model tested, showing near-zero ESR
Patchscopesframework0.700
Unifying framework for inspecting hidden representations of language models via representation interventions
gemma-3-12b-itconcept0.700
12B Gemma model tested; used for openness linearity visualization (Figure 6)
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.684
Extension of mechanistic interpretability findings to the metacognitive domain