Diagnostic Probing

Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction

Neighborhood — ranked by edge-count

framework

Distributed Alignment Search (DAS)
analogous_to
Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent

method

Probing Methods
related_to
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probesconcept0.832
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Activation Probingconcept0.822
Technique of reading out model beliefs from internal activations before the final answer token is generated
Diagnosismethod0.808
The method of examining a neighborhood meter by meter to identify healthy and damaged places as the basis for ongoing repair.
Linear Probingmethod0.788
Used to evaluate representation quality across VTAB tasks
Amnesic Probingmethod0.783
Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
Unsupervised Probingmethod0.782
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Sparse Probingmethod0.776
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
base model probingmethod0.769
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.