Sparse Probing

Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probing Methodsmethod0.809
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Sparse and smooth codingconcept0.803
Coding scheme where qualities are represented by few neurons with continuous similarity relations.
Probesconcept0.794
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Unsupervised Probingmethod0.792
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Activation Probingconcept0.782
Technique of reading out model beliefs from internal activations before the final answer token is generated
Diagnostic Probingmethod0.776
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Sparse Feature Dictionaryconcept0.766
The extracted set of sparse interpretable features from model embeddings via SAEs
Sparse circuitsconcept0.754
A goal in mechanistic interpretability to identify sparse computational subgraphs; VPD promotes sparse parameter circuits.