concept

active

concept:sparse-feature-circuits-discovering-and-editing-interpretable-causal-graphs-in-language-models-marks-et-al-2025

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)

Cited as enabling precise behavioral control through SAE features, extending the same methodological line

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.830
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders produce interpretable features for large models.claim0.807
Central claim of the paper: the method scales to state-of-the-art transformers.
An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.800
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.claim0.795
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.790
Motivation for using sparsity-based dictionary learning on language models
Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.claim0.788
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
Causal abstraction implicitly relies on strong assumptions about feature encoding in DNNs, and becomes trivial without such assumptionsclaim0.784
Authors' interpretation connecting their proof to practical interpretability methodology
Analogous features and circuits form across models and tasks.claim0.775
Third of three speculative claims asserting that learned features are not model-specific but represent universal solutions to learning problems