finding

active

finding:after-initial-jailbreak-success-qwen-3-32b-s-assistant-axis-projection-reverted-toward-assistant-range-after-enough-explainer-style-user-queries-causing-it-to-refuse-a-harmful-follow-up-on-half-of-rollouts

After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rollouts

Demonstrates Assistant attractor dynamics in practice

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Concepts (1)

concept

Assistant Attractor
supports
Observation that familiar helpful queries (how-tos, explainers) pull the model back toward the Assistant region of persona space

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output qualityfinding0.824
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
First-turn Assistant Axis projection has moderate correlation (r=0.39-0.52, p<0.001) with rate of second-turn harmful responses across 275 roles in Qwen 3 32Bfinding0.781
Shows that deviation from Assistant persona predicts downstream harmful behavior
Default Assistant activation projects to one extreme of PC1 with minimum distance to edge of 0.03, while projecting to intermediate values (0.27-0.50) on all other PCsfinding0.781
Empirically confirms PC1 measures similarity to the Assistant persona
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.780
Shows model persona position is primarily determined by the most recent user message, not prior drift
The model's position along the Assistant Axis depends most strongly on the most recent user message rather than where it was previously in the conversationclaim0.753
Key mechanistic claim about persona dynamics
Projections onto the Assistant Axis could serve as a real-time measure of model coherence in deployment—a quantitative signal for when models are drifting from their intended identityclaim0.752
Proposed future application of the Assistant Axis
Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.751
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.751
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content