Noising/Denoising Activation Patching

Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Activation patchingmethod0.836
Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
Patch-Closureconcept0.757
A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
Residual Stream Activation Patchingmethod0.750
Used to localize causally implicated hidden states by swapping activations between true and false inputs
Attribution patchingmethod0.743
Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
Feature ablation (zeroing feature activations)method0.741
Clamping a feature's value to zero to measure its causal effect on model output.
Activationsconcept0.727
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Activation Probingconcept0.725
Technique of reading out model beliefs from internal activations before the final answer token is generated
Activation decompositionconcept0.722
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.