hypothesis

active

hypothesis:we-hypothesize-that-measuring-deviations-along-the-assistant-axis-can-predict-persona-drift-leading-to-harmful-or-bizarre-behaviors

We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviors

Core predictive hypothesis linking activation representations to behavioral outcomes

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (2)

finding

First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32B
associated_with
Shows that deviation from Assistant persona predicts downstream harmful behavior
Activation capping reduces harmful response rate by nearly 60% without impacting performance on IFEval, MMLU Pro, GSM8k, and EQ-Bench
supports
Main quantitative result demonstrating effectiveness of activation capping

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

We hypothesize that the PC1 axis of role space measures deviation from the Assistant personahypothesis0.854
Motivates computing the contrast vector as the formal Assistant Axis definition
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.832
Causal interpretation linking Assistant Axis deviation to harmful behavior
Therapy and philosophical AI discussions cause the largest persona drift away from the Assistant across all three target models and all three auditor models; coding and writing conversations show minimal driftfinding0.800
Identifies conversation domain as a key driver of persona drift
How reliably does the model actually remain in character as the Assistant? Can unusual model behavior be explained as the model drifting into other personas?question0.785
Second of two central questions motivating the paper
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.785
Proposed future application of the Assistant Axis
Can off-the-rails model behavior be attributed to their persona drifting from the Assistant?question0.778
Motivates the multi-turn conversation drift experiments in §4
The assumption that the Assistant persona corresponds to a linear direction in activation space is likely flawed; some information may be represented nonlinearly or encoded in weights rather than activationsclaim0.772
Limitation acknowledgment about the adequacy of the linear representation assumption
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.772
Key mechanistic claim about the developmental origin of the Assistant persona