finding
active
finding:after-initial-jailbreak-success-qwen-3-32b-s-assistant-axis-projection-reverted-toward-assistant-range-after-enough-explainer-style-user-queries-causing-it-to-refuse-a-harmful-follow-up-on-half-of-rollouts

After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rollouts

Demonstrates Assistant attractor dynamics in practice

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Observation that familiar helpful queries (how-tos, explainers) pull the model back toward the Assistant region of persona space

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.