Assistant Axis

Contrast vector between mean default Assistant activation and mean of all fully role-playing role vectors; main contribution of the paper

Neighborhood — ranked by edge-count

Papers (1)

paper

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
introduces

Methods (3)

method

Activation Steering
uses
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Activation Capping
implements
Clamping activations along the Assistant Axis to remain above a minimum threshold (25th percentile), introduced as a stabilization method
Difference-in-Means
implements
Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis

Concepts (1)

concept

Persona Space
implements
Low-dimensional space of activation directions corresponding to diverse character archetypes in LLMs

Claims (1)

claim

The contrast vector method is recommended over PC1 for reproducing the Assistant Axis in different models because it is not guaranteed that PC1 in every model will correspond to an Assistant Axis
supports
Practical methodological recommendation based on Llama 3.1 70B failure case

Frameworks (1)

framework

Persona Vectors (Chen et al.)
extends
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles

Artifacts (1)

artifact

safety-research/assistant-axis GitHub repository
about
Code and full transcripts of case studies released alongside the paper

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The Assistant Axis is also present in pre-trained base models, where it primarily promotes helpful human archetypes (consultants, coaches) and inhibits spiritual onesclaim0.808
Extends the Assistant Axis finding to pre-training, suggesting pre-training rather than post-training creates the axis
Is the Assistant Axis formed during post-training or inherited from representations learned during pre-training?question0.762
Motivates the base model steering experiments in §3.2.2
Assistant Attractorconcept0.761
Observation that familiar helpful queries (how-tos, explainers) pull the model back toward the Assistant region of persona space
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.761
Key mechanistic claim about persona dynamics
Axis Of Persuadabilityframework0.759
Practical engineering framework for determining optimal level of control for a given system, from brute force to rational argument.
Axis of Persuasabilityconcept0.759
A continuous scale from brute-force control to rational persuasion that defines the degree of agency of a system.
AI alignmentconcept0.749
Field within which this work has implications for evaluating alignment progress.
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.746
Key mechanistic claim about the developmental origin of the Assistant persona