concept
active
concept:gemma-scope-open-sparse-autoencoders-everywhere-all-at-once-on-gemma-2-lieberum-et-al-2024Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 (Lieberum et al., 2024)
Paper introducing GemmaScope SAEs used for Gemma-2 model experiments
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Critique of activation-based interpretability methods.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.781Core methodology paper for SAE-based interpretable feature extraction
- Paper describing Gemma 2 model family used in this study
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights