Reflection Enhancement via Activation Addition

Adding steering vector in forward direction to push model activations toward stronger reflective behavior.

Neighborhood — ranked by edge-count

method

Activation Addition
related_to
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Reflection Inhibition via Activation Subtractionmethod0.819
Applying reverse steering vector to suppress reflective behavior at inference time.
Activation Addition (ActAdd)framework0.796
Steering method deriving vectors from contrastive prompt pairs and adding to first-token activations.
Contrastive Activation Addition (CAA)method0.790
An existing activation steering method used as comparative baseline.
Our method achieves superior performance compared to Contrastive Activation Addition.finding0.779
Performance gains over CAA in steering tasks.
Response Text Augmentationmethod0.767
Strategy using GPT-4o, Claude 3.5 Sonnet, and Gemini to generate additional responses preserving original meaning, targeting ≥1000 words concatenated per score category.
Activation Reconstructor (AR)method0.766
Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
Activationsconcept0.766
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Triggered Reflectionconcept0.766
Reflection level where explicit cue words (e.g., 'wait') prompt the model to inspect and revise reasoning.