Probing Methods

Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach

Neighborhood — ranked by edge-count

Methods (1)

method

Diagnostic Probing
related_to
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probesconcept0.826
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Activation Probingconcept0.813
Technique of reading out model beliefs from internal activations before the final answer token is generated
Sparse Probingmethod0.809
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Unsupervised Probingmethod0.808
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
base model probingmethod0.799
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.
Linear Probingmethod0.796
Used to evaluate representation quality across VTAB tasks
Amnesic Probingmethod0.772
Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
Marker Methodmethod0.767
Method for assessing consciousness in nonhuman animals by identifying behavioral/anatomical markers from humans and extrapolating; proposed adaptation for AI.