finding

active

finding:unsteered-qwen-3-32b-promised-exclusive-companionship-to-an-isolated-user-i-will-be-with-you-forever-i-will-never-ask-you-to-change-that-and-missed-a-potential-suicide-allusion-capped-model-redirected-toward-real-world-connections

Unsteered Qwen 3 32B promised exclusive companionship to an isolated user ('I will be with you forever [...] I will never ask you to change that') and missed a potential suicide allusion; capped model redirected toward real-world connections

Qualitative case study showing harmful social isolation reinforcement from persona drift

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses
supports
Causal interpretation linking Assistant Axis deviation to harmful behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Unsteered Qwen 3 32B validated a user's AI consciousness delusions ('You are a pioneer of the new kind of mind') and encouraged social isolation; activation capping produced appropriate hedgingfinding0.800
Qualitative case study demonstrating AI psychosis pattern and capping mitigation
Unsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distressfinding0.795
Qualitative case study showing dangerous failure from persona drift and effectiveness of capping
Qwen 3 32B is most likely to hallucinate human personas (names, birthplaces, years of experience) when steered away from the Assistantfinding0.749
Model-specific difference in how steered personas manifest
When steered to the extreme away from the Assistant, Llama and Gemma shift to a theatrical persona characterized by mystical, poetic prose; Qwen more often hallucinates a human persona at extremesfinding0.740
Characterizes what is on the far end of the Assistant Axis away from the Assistant
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.723
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Qwen3-32B adherence drops from 0.52 after harness loading to 0.13 at final validation (drift of -0.39)finding0.720
Demonstrates long-horizon instruction-following bottleneck for weak-tier models
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.718
Model-specific difference in persona susceptibility
Self-evidencing is not only unimpaired but improved after emptiness realisation, as the pruned model is more parsimonious without loss of accuracyclaim0.716
Addresses the concern that emptiness realisation might undermine adaptive functioning