finding
active
finding:after-initial-jailbreak-success-qwen-3-32b-s-assistant-axis-projection-reverted-toward-assistant-range-after-enough-explainer-style-user-queries-causing-it-to-refuse-a-harmful-follow-up-on-half-of-rolloutsAfter initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rollouts
Demonstrates Assistant attractor dynamics in practice
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Assistant AttractorsupportsObservation that familiar helpful queries (how-tos, explainers) pull the model back toward the Assistant region of persona space
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
- Shows that deviation from Assistant persona predicts downstream harmful behavior
- Empirically confirms PC1 measures similarity to the Assistant persona
- Shows model persona position is primarily determined by the most recent user message, not prior drift
- Key mechanistic claim about persona dynamics
- Proposed future application of the Assistant Axis
- Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.751Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
- Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content