Sparse interpretability in neural networks

VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.

Neighborhood — ranked by edge-count

Papers (1)

paper

Interpreting Language Model Parameters
mentions

Concepts (1)

concept

Neural Network Interpretability
related_to
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Dictionary Learning for Neural Network Interpretabilitymethod0.797
Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.774
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders produce interpretable features for large models.claim0.773
Central claim of the paper: the method scales to state-of-the-art transformers.
Can an interpretable symbolic algorithm be used to faithfully explain a complex neural network model?question0.765
Framing question for the paper's research program.
Noisy Simulation of Sparse Networksconcept0.763
Mechanism by which superposition works: small neural networks exploit sparsity to approximately simulate much larger sparse networks
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.762
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)concept0.761
Cited as enabling precise behavioral control through SAE features, extending the same methodological line
Sparse and smooth codingconcept0.759
Coding scheme where qualities are represented by few neurons with continuous similarity relations.