method
active
method:directional-ablationDirectional Ablation
Intervention method that removes a direction from residual stream activations to disrupt corresponding behavior
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- Concept ConesusesThe central framework this paper extends from refusal to propositional truth; identifies multi-dimensional subspaces that causally mediate target behaviors
Concepts (1)
concept
- Surgical Ablation Propertyassociated_withProperty requiring that ablating a truth direction shifts model output from truthful to false without other side effects
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Intervention type that sets activations to zero, used for interpretability analysis
- Classical techniques to interrogate regulative capacity of embryos and neural crest by tissue removal or transplantation.
- Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
- An algorithm that determines the marginal effect of n-th order path terms by running the model multiple times with frozen attention patterns and progressively replacing activations
- The translation of semantic values into spatial coordinates and relations.
- Special case of immediate feedback loop where user interacts with artifacts in a lifelike manner, typically through cursor or finger-based dragging.
- Central property of agency: energy expended to reach specific states despite disturbances.
- Systematic sweep of 10 boost levels from threshold-3σ to threshold+3σ to characterize ESR vs. steering strength