claim

active

claim:projections-onto-the-assistant-axis-could-serve-as-a-real-time-measure-of-model-coherence-in-deployment-a-quantitative-signal-for-when-models-are-drifting-from-their-intended-identity

Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identity

Proposed future application of the Assistant Axis

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.798
Key mechanistic claim about persona dynamics
First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bfinding0.795
Shows that deviation from Assistant persona predicts downstream harmful behavior
The Assistant Axis in instruct models mainly inherits from pre-existing helpful and harmless human personas in base models, later acquiring additional associations (such as being an AI) during post-trainingclaim0.785
Key mechanistic claim about the developmental origin of the Assistant persona
We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.785
Core predictive hypothesis linking activation representations to behavioral outcomes
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.784
Shows model persona position is primarily determined by the most recent user message, not prior drift
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.779
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llamafinding0.777
Characterizes the trait content of the Assistant Axis in pre-trained models
Configuration coherence can be measured by counting locally symmetric sub-configurations; this measure shows strong agreement with cognition and perception experiments.finding0.771
Quantifiable measure linking structural properties of configurations to human perception, supporting the mathematical reality of wholeness.