Probes

Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.

Neighborhood — ranked by edge-count

paper

method

Probe-Based Data Attribution
cites
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Linear Probe
implements
Simple linear classifiers trained on model activations used as the probing technique within the introduced method.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Diagnostic Probingmethod0.832
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Probing Methodsmethod0.826
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Probe scoreconcept0.820
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Activation Probingconcept0.799
Technique of reading out model beliefs from internal activations before the final answer token is generated
Sparse Probingmethod0.794
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Unsupervised Probingmethod0.793
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Probe Generalizationconcept0.792
The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets
Focus probe (distracted vs. focused)concept0.768
One of four emotive concept probes trained; contrastive pair distracted/focused with best layer 10 in LLaMA-3.2-3B