method
active
method:reflection-inhibition-via-activation-subtractionReflection Inhibition via Activation Subtraction
Applying reverse steering vector to suppress reflective behavior at inference time.
Neighborhood — ranked by edge-count
Methods (1)
method
- Activation SteeringimplementsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
- Applied security implication derived from the asymmetry finding.
- Central interpretive claim of the paper, supported by steering vector experiments.
- Technique for obtaining concept vectors by presenting model with two scenarios differing in one respect and subtracting activations to isolate conceptual difference.
- Empirical interpretation of which reference baseline yields more useful steering vectors.
- Key asymmetry finding interpreted mechanistically by the authors.
- Applied dual-use conclusion drawn from the paper's findings.
- Reflection level where explicit cue words (e.g., 'wait') prompt the model to inspect and revise reasoning.