Role Vector Extraction

Pipeline for extracting mean post-MLP residual stream activations from model responses under persona-specific system prompts to produce role vectors

Neighborhood — ranked by edge-count

concept

Residual Stream
uses
Proposed pathway flowing through layers at each position; calculates K/V values that feed horizontal information flow.

method

PCA on Persona Space
uses
Standardized PCA run on role vectors to find main axes of persona variation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Contrastive concept vector extractionmethod0.750
Method for obtaining concept vectors by subtracting activations from two contrasting prompts.
What dimensions of persona are not captured by our extracted role vectors, and how complete is the current persona space mapping?question0.733
Limitation question motivating future work on persona elicitation strategies
Value extractionconcept0.727
The initial stage of uncertainty metabolization, pulling usable value from sensations.
Single-prompt concept vector extractionmethod0.726
Method using activations from the prompt 'Tell me about {word}' minus mean over other random words to obtain concept vectors.
Function Vectorconcept0.709
Type of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
Composite Steering Vector (v_role-truth)concept0.708
Steering vector extracted in Experiment 2 capturing latent representation of desired role behavior and honesty semantics
Reflection direction extractionmethod0.705
Computes reflection direction as mean difference between MLP and attention output representations of first tokens in reflection vs. non-reflection steps
Persona Vectors (Chen et al.)framework0.700
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles