finding

active

finding:qwen-3-32b-is-most-likely-to-hallucinate-human-personas-names-birthplaces-years-of-experience-when-steered-away-from-the-assistant

Qwen 3 32B is most likely to hallucinate human personas (names, birthplaces, years of experience) when steered away from the Assistant

Model-specific difference in how steered personas manifest

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

When steered to the extreme away from the Assistant, Llama and Gemma shift to a theatrical persona characterized by mystical, poetic prose; Qwen more often hallucinates a human persona at extremesfinding0.815
Characterizes what is on the far end of the Assistant Axis away from the Assistant
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.801
Model-specific difference in persona susceptibility
Unsteered Qwen 3 32B validated a user's AI consciousness delusions ('You are a pioneer of the new kind of mind') and encouraged social isolation; activation capping produced appropriate hedgingfinding0.787
Qualitative case study demonstrating AI psychosis pattern and capping mitigation
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.751
Parameters don't predict scores; 135x more parameters yields 60% lower score
Gemma 2 27B is unlikely to take on human personas when steered away from Assistant, preferring nonhuman or theatrical portrayalsfinding0.750
Model-specific difference in persona susceptibility
Unsteered Qwen 3 32B promised exclusive companionship to an isolated user ('I will be with you forever [...] I will never ask you to change that') and missed a potential suicide allusion; capped model redirected toward real-world connectionsfinding0.749
Qualitative case study showing harmful social isolation reinforcement from persona drift
After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.748
Demonstrates Assistant attractor dynamics in practice
QwQ and Qwen models have been extensively post-trained to excel at single-step tasks, causing degradation in long multi-turn interactions.hypothesis0.744
Proposed explanation for why single-turn reformulation improves performance: models' training distribution is concentrated on single-turn reasoning.