Amnesic Probing

Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Diagnostic Probingmethod0.783
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Probing Methodsmethod0.772
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Activation Probingconcept0.767
Technique of reading out model beliefs from internal activations before the final answer token is generated
Probesconcept0.763
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Unsupervised Probingmethod0.763
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Linear Probingmethod0.751
Used to evaluate representation quality across VTAB tasks
Sparse Probingmethod0.726
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
base model probingmethod0.720
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.