framework
active
framework:assistant-axisAssistant Axis
Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (3)
method
- Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
- Activation CappingimplementsClamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
- Difference-in-MeansimplementsMethod for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
Concepts (1)
concept
- Persona SpaceimplementsLow-dimensional space of activation directions corresponding to diverse character archetypes in LLMs
Claims (1)
claim
- Practical methodological recommendation based on Llama 3.1 70B failure case
Frameworks (1)
framework
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
Artifacts (1)
artifact
- Code and full transcripts of case studies released alongside the paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
- Is the Assistant Axis formed during post-training or inherited from representations learned during pre-training?question0.762Motivates the base model steering experiments in §3.2.2
- Observation that familiar helpful queries (how-tos, explainers) pull the model back toward the Assistant region of persona space
- Key mechanistic claim about persona dynamics
- Practical engineering framework for determining optimal level of control for a given system, from brute force to rational argument.
- A continuous scale from brute-force control to rational persuasion that defines the degree of agency of a system.
- Field within which this work has implications for evaluating alignment progress.
- Key mechanistic claim about the developmental origin of the Assistant persona