method
active
method:noising-denoising-activation-patching

Noising/Denoising Activation Patching

Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.
  • Patch-Closureconcept0.757
    A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
  • Used to localize causally implicated hidden states by swapping activations between true and false inputs
  • Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
  • Clamping a feature's value to zero to measure its causal effect on model output.
  • Activationsconcept0.727
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Activation Probingconcept0.725
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.