concept

active

concept:gemma-scope-open-sparse-autoencoders-everywhere-all-at-once-on-gemma-2-lieberum-et-al-2024

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 (Lieberum et al., 2024)

Paper introducing GemmaScope SAEs used for Gemma-2 model experiments

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders produce interpretable features for large models.claim0.808
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.798
Critique of activation-based interpretability methods.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.781
Core methodology paper for SAE-based interpretable feature extraction
Gemma 2: Improving Open Language Models at a Practical Size (Team et al., 2024)concept0.774
Paper describing Gemma 2 model family used in this study
Sparse Autoencoder Featuresconcept0.770
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.765
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse Autoencoderframework0.765
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.762
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights