Persona Vectors (Chen et al.)

Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles

Neighborhood — ranked by edge-count

paper

framework

Assistant Axis
extends
Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

steering vectorsconcept0.762
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Sense Vectorsconcept0.759
Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.753
Limitation question motivating future work on persona elicitation strategies
Deception Vectorconcept0.739
Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
Refusal Vectorconcept0.730
Single linear direction mediating refusal behavior in LLMs, shown by Arditi et al.; related to but distinct from the Assistant Axis
About Blank Vector 8 (AI Phenomenology)framework0.729
Research framework concerned with first-person reports in AI systems; strong conceptual fit with performative reasoning findings.
concept vectorconcept0.728
Computed directional vector in activation space representing a specific concept, used for injection experiments
Steering vectors to reduce eval awareness can inadvertently insert alternative user personasclaim0.728
Caution: interventions targeting eval awareness may have unintended side effects.