method
active
method:diagnostic-probing

Diagnostic Probing

Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Methods (1)

method
  • Probing Methods
    related_to
    Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Probesconcept0.832
    Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
  • Activation Probingconcept0.822
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Diagnosismethod0.808
    The method of examining a neighborhood meter by meter to identify healthy and damaged places as the basis for ongoing repair.
  • Linear Probingmethod0.788
    Used to evaluate representation quality across VTAB tasks
  • Amnesic Probingmethod0.783
    Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
  • Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
  • Sparse Probingmethod0.776
    Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
  • Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.