Activation Probing

Technique of reading out model beliefs from internal activations before the final answer token is generated

Neighborhood — ranked by edge-count

paper

framework

Reasoning Theater Framework
uses
The conceptual framework introduced by the paper distinguishing performative CoT from genuine reasoning using activation probing

concept

Model Internal Belief
about
The latent representational state of a model's answer confidence as decoded from activations, distinct from what appears in generated text

finding

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Diagnostic Probingmethod0.822
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Probing Methodsmethod0.813
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Activationsconcept0.811
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Probesconcept0.799
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Activation patchingmethod0.794
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Sparse Probingmethod0.782
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
Activation decompositionconcept0.782
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.
Unsupervised Probingmethod0.781
Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff