method
active
method:activation-patching

Activation patching

Standard method in mechanistic interpretability that intervenes on activations; VPD flips this paradigm by patching parameters.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Concepts (1)

concept
  • Mechanistic interpretability technique for locating factual associations, mentioned as future work direction.

Methods (1)

method
  • Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.

Claims (1)

claim

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Methods that intentionally introduce divergent representations to test sufficiency and completeness of circuits
  • Activationsconcept0.821
    Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
  • Path Patchingmethod0.805
    Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Patch-Closureconcept0.797
    A set property meaning all coordinate patches of its elements remain within the set; proved equivalent to axis-aligned hyperrectangles
  • Activation Probingconcept0.794
    Technique of reading out model beliefs from internal activations before the final answer token is generated
  • Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
  • Used to localize causally implicated hidden states by swapping activations between true and false inputs