concept
active
concept:activation-probing

Activation Probing

Technique of reading out model beliefs from internal activations before the final answer token is generated

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The conceptual framework introduced by the paper distinguishing performative CoT from genuine reasoning using activation probing

Concepts (1)

concept
  • The latent representational state of a model's answer confidence as decoded from activations, distinct from what appears in generated text

Findings (1)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
  • Probing Methodsmethod0.813
    Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
  • Activationsconcept0.811
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Probesconcept0.799
    Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Sparse Probingmethod0.782
    Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
  • The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.
  • Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff