Reflection Inhibition via Activation Subtraction

Applying reverse steering vector to suppress reflective behavior at inference time.

Neighborhood — ranked by edge-count

Methods (1)

method

Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection Enhancement via Activation Additionmethod0.819
Adding steering vector in forward direction to push model activations toward stronger reflective behavior.
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.775
Applied security implication derived from the asymmetry finding.
Reflection is not merely a behavioral artifact of prompting but a phenomenon encoded in the model's activation space.claim0.768
Central interpretive claim of the paper, supported by steering vector experiments.
Contrastive pair activation subtractionmethod0.766
Technique for obtaining concept vectors by presenting model with two scenarios differing in one respect and subtracting activations to isolate conceptual difference.
Contrasting No Reflection with Triggered Reflection (µ(0→2)) provides a stronger reflection signal than contrasting Intrinsic with Triggered Reflection (µ(1→2)).claim0.747
Empirical interpretation of which reference baseline yields more useful steering vectors.
Suppressing reflection is considerably easier than inducing it, because inhibition requires the model to terminate reasoning while enhancement demands additional cognitive effort to re-examine reasoning trajectories.claim0.746
Key asymmetry finding interpreted mechanistically by the authors.
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.claim0.744
Applied dual-use conclusion drawn from the paper's findings.
Triggered Reflectionconcept0.732
Reflection level where explicit cue words (e.g., 'wait') prompt the model to inspect and revise reasoning.