method
active
method:noising-denoising-activation-patchingNoising/Denoising Activation Patching
Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
- A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
- Used to localize causally implicated hidden states by swapping activations between true and false inputs
- Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
- Clamping a feature's value to zero to measure its causal effect on model output.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Technique of reading out model beliefs from internal activations before the final answer token is generated
- The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.