concept

active

concept:towards-monosemanticity-decomposing-language-models-with-dictionary-learning-bricken-et-al-2023

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (Bricken et al., 2023)

Foundational SAE mechanistic interpretability paper

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Venues (1)

venue

Transformer Circuits Thread
cites
Anthropic's mechanistic interpretability research blog where this paper was published.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

monosemanticityconcept0.803
Interpretability property where a latent feature represents a single semantic concept; benchmarked across architectures.
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.799
Motivation for using sparsity-based dictionary learning on language models
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)concept0.794
Key paper on scaling SAE-based interpretability to frontier models, cited as precedent
Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.claim0.779
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.778
Selective pressure toward convergence via task generality
Language models are few-shot learners (Brown et al., 2020)concept0.775
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.774
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (Li et al., 2023)concept0.772
Safety intervention that relies on activation modification, which ESR might undermine