Unsupervised Probing

Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Unsupervised Learningmethod0.822
Learning that builds a low-dimensional model of input data without error signals or rewards; Hebbian learning is an example.
Probing Methodsmethod0.808
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Probesconcept0.793
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Sparse Probingmethod0.792
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Unsupervised Behavior Clusteringmethod0.784
Method that clusters behaviors without prior labels, used to surface concerning learned patterns.
Diagnostic Probingmethod0.782
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Activation Probingconcept0.781
Technique of reading out model beliefs from internal activations before the final answer token is generated
base model probingmethod0.768
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.