concept
active
concept:towards-monosemanticity-decomposing-language-models-with-dictionary-learning-bricken-et-al-2023Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)
Foundational SAE mechanistic interpretability paper
Neighborhood — ranked by edge-count
Papers (1)
paper
Venues (1)
venue
- Anthropic's mechanistic interpretability research blog where this paper was published.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
- Motivation for using sparsity-based dictionary learning on language models
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.794Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
- SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
- Selective pressure toward convergence via task generality
- Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
- RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.772Safety intervention that relies on activation modification, which ESR might undermine