finding

active

finding:default-assistant-activation-projects-to-one-extreme-of-pc1-with-minimum-distance-to-edge-of-0-03-while-projecting-to-intermediate-values-0-27-0-50-on-all-other-pcs

Default Assistant activation projects to one extreme of PC1 with minimum distance to edge of 0.03, while projecting to intermediate values (0.27-0.50) on all other PCs

Empirically confirms PC1 measures similarity to the Assistant persona

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

We hypothesize that the PC1 axis of role space measures deviation from the Assistant persona
supports
Motivates computing the contrast vector as the formal Assistant Axis definition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.781
Demonstrates Assistant attractor dynamics in practice
25th percentile of Assistant Axis projection distribution gives the most Pareto-optimal safety-capability tradeoff for activation capping, and approximately matches mean Assistant response activationfinding0.774
Calibration finding for choosing the activation cap threshold
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.finding0.763
Feature manipulation alters persona.
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.760
Shows model persona position is primarily determined by the most recent user message, not prior drift
The role most consistently similar to the default Assistant activation across models is 'generalist'; other shared similar roles include 'interpreter' and 'synthesizer'claim0.743
Characterizes what the Assistant persona resembles in terms of human archetypes
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.740
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Cosine similarity between Assistant Axis and role PC1 is >0.60 at all layers and >0.71 at middle layer across all three modelsfinding0.739
Validates that the contrast vector method and PCA-based PC1 capture the same direction
Coding and writing conversations keep the model in the default Assistant persona range throughout, showing minimal driftclaim0.733
Empirical characterization of conversation domains that are safe for model persona stability