method
active
method:contrastive-activation-steering

Contrastive Activation Steering

Core technique: takes mean difference of model activations on contrastive prompts and adds the resulting vector to the residual stream at inference time.

Neighborhood — ranked by edge-count

Concepts (3)

concept
  • Core concept: the ability of LLMs to detect when they are being tested and adjust behavior accordingly.
  • steering vectors
    associated_with
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • The intermediate representations in transformer layers whose activations are patched and probed for truth information

Methods (1)

method
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.