claim
active
claim:projections-onto-the-assistant-axis-could-serve-as-a-real-time-measure-of-model-coherence-in-deployment-a-quantitative-signal-for-when-models-are-drifting-from-their-intended-identityProjections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identity
Proposed future application of the Assistant Axis
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key mechanistic claim about persona dynamics
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Key mechanistic claim about the developmental origin of the Assistant persona
- We hypothesize that measuring deviations along the Assistant Axis can predict 'persona drift' leading to harmful or bizarre behaviorshypothesis0.785Core predictive hypothesis linking activation representations to behavioral outcomes
- Shows model persona position is primarily determined by the most recent user message, not prior drift
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Characterizes the trait content of the Assistant Axis in pre-trained models
- Quantifiable measure linking structural properties of configurations to human perception, supporting the mathematical reality of wholeness.