base model probing

Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.

Neighborhood — ranked by edge-count

Concepts (1)

concept

in-context learning (ICL)
implements
Test-time adaptation from prompt or retrieved context with no parameter updates.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probing Methodsmethod0.799
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Diagnostic Probingmethod0.769
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Unsupervised Probingmethod0.768
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
Probesconcept0.748
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Sparse Probingmethod0.748
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Activation Probingconcept0.744
Technique of reading out model beliefs from internal activations before the final answer token is generated
Logistic Regression Probemethod0.743
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.741
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence