Contrastive Steering Vector Construction

Method for computing steering vectors as mean activation differences between reflection levels at a given layer.

Neighborhood — ranked by edge-count

Papers (1)

paper

Unveiling the Latent Directions of Reflection in Large Language Models
introduces

Concepts (1)

concept

Contrastive Pairs
uses
Pairs of prompts at different reflection levels used to compute steering vectors.

Methods (1)

method

Activation Steering
implements
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.824
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
steering vectorsconcept0.808
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Contrastive Activation Steeringmethod0.799
Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.
Contrastive concept vector extractionmethod0.778
Method for obtaining concept vectors by subtracting activations from two contrasting prompts.
Composite Steering Vector (v_role-truth)concept0.776
Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.758
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.755
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Bidirectional Steeringconcept0.749
Ability to steer model behavior in two opposite semantic directions on a trait.