Sparse Autoencoder-based Framework for Steering Semantic Features

The main framework proposed for retrieving and steering high-order semantic features in LLMs via sparse autoencoders.

Neighborhood — ranked by edge-count

paper

method

Sparse Autoencoders (SAE)
implements
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Contrastive Feature Retrieval Pipeline
implements
A pipeline employing controlled semantic oppositions to distill monosemantic functional features from sparse activation spaces.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoder Featuresconcept0.837
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
Sparse Autoencoderframework0.830
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.808
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders produce interpretable features for large models.claim0.803
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse Autoencoder for Dictionary Learningframework0.796
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.789
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse Autoencoders (SAE) activation-based paradigmframework0.780
Standard interpretability approach that VPD critiques and proposes an alternative to.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.771
Critique of activation-based interpretability methods.