method
active
method:diagnostic-probingDiagnostic Probing
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Distributed Alignment Search (DAS)analogous_toPractical method by Geiger et al. for finding distributed causal abstractions using gradient descent
Methods (1)
method
- Probing Methodsrelated_toTop-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- Technique of reading out model beliefs from internal activations before the final answer token is generated
- The method of examining a neighborhood meter by meter to identify healthy and damaged places as the basis for ongoing repair.
- Used to evaluate representation quality across VTAB tasks
- Behavioral explanation technique using amnesic counterfactuals by Elazar et al. 2020
- Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
- Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
- Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.