method
active
method:directional-ablation

Directional Ablation

Intervention method that removes a direction from residual stream activations to disrupt corresponding behavior

Neighborhood — ranked by edge-count

Frameworks (1)

framework
  • The central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors

Concepts (1)

concept
  • Property requiring that ablating a truth direction shifts model output from truthful to false without other side effects

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Zero Ablationmethod0.805
    Intervention type that sets activations to zero, used for interpretability analysis
  • Classical techniques to interrogate regulative capacity of embryos and neural crest by tissue removal or transplantation.
  • Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
  • An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
  • spatializationconcept0.719
    The translation of semantic values into spatial coordinates and relations.
  • Special case of immediate feedback loop where user interacts with artifacts in a lifelike manner, typically through cursor or finger-based dragging.
  • Central property of agency: energy expended to reach specific states despite disturbances.
  • Systematic sweep of 10 boost levels from threshold-3σ to threshold+3σ to characterize ESR vs. steering strength