framework
active
framework:sparse-autoencoder-based-framework-for-steering-semantic-featuresSparse Autoencoder-based Framework for Steering Semantic Features
The main framework proposed for retrieving and steering high-order semantic features in LLMs via sparse autoencoders.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- Sparse Autoencoders (SAE)implementsInterpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Contrastive Feature Retrieval PipelineimplementsA pipeline employing controlled semantic oppositions to distill monosemantic functional features from sparse activation spaces.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
- Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.808Core methodology paper for SAE-based interpretable feature extraction
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
- Empirical principle discovered during autoencoder training; led to using 8 billion training points
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Critique of activation-based interpretability methods.