concept
active
concept:sparse-feature-circuits-discovering-and-editing-interpretable-causal-graphs-in-language-models-marks-et-al-2025Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)
Cited as enabling precise behavioral control through SAE features, extending the same methodological line
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.830Core methodology paper for SAE-based interpretable feature extraction
- Central claim of the paper: the method scales to state-of-the-art transformers.
- Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
- Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
- Motivation for using sparsity-based dictionary learning on language models
- Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
- Authors' interpretation connecting their proof to practical interpretability methodology
- Third of three speculative claims asserting that learned features are not model-specific but represent universal solutions to learning problems