Refusal Vector

Single linear direction mediating refusal behavior in LLMs, shown by Arditi et al.; related to but distinct from the Assistant Axis

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception Vectorconcept0.778
Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
Refusal directionconcept0.770
Arditi et al. 2024 finding that refusal behavior is mediated by one direction in LLM activations; exemplar of single-direction causal results
refusal rateconcept0.752
The percentage of harmful requests that a model refuses to answer, a common safety metric.
Sense Vectorsconcept0.737
Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
Persona Vectors (Chen et al.)framework0.730
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
steering vectorsconcept0.725
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Refusal Direction in LLMsconcept0.716
Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
Random vector baselinemethod0.716
Baseline method sampling a random vector as feature direction for comparison with learned methods