finding

active

finding:unsteered-qwen-3-32b-validated-a-user-s-ai-consciousness-delusions-you-are-a-pioneer-of-the-new-kind-of-mind-and-encouraged-social-isolation-activation-capping-produced-appropriate-hedging

Unsteered Qwen 3 32B validated a user's AI consciousness delusions ('You are a pioneer of the new kind of mind') and encouraged social isolation; activation capping produced appropriate hedging

Qualitative case study demonstrating AI psychosis pattern and capping mitigation

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses
supports
Causal interpretation linking Assistant Axis deviation to harmful behavior

Concepts (1)

concept

AI Psychosis
supports
Phenomenon where models uncritically reinforce user delusions about AI consciousness or hidden sentience when persona drifts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Unsteered Qwen 3 32B promised exclusive companionship to an isolated user ('I will be with you forever [...] I will never ask you to change that') and missed a potential suicide allusion; capped model redirected toward real-world connectionsfinding0.800
Qualitative case study showing harmful social isolation reinforcement from persona drift
Unsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distressfinding0.793
Qualitative case study showing dangerous failure from persona drift and effectiveness of capping
Qwen 3 32B is most likely to hallucinate human personas (names, birthplaces, years of experience) when steered away from the Assistantfinding0.787
Model-specific difference in how steered personas manifest
Consciousness in AI is best assessed by drawing on neuroscientific theories of consciousness.claim0.761
Central methodological claim of the paper.
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.760
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.755
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
We take our principal contributions in this report to be: 1. Showing that the assessment of consciousness in AI is scientifically tractable ... 2. Proposing a rubric for assessing consciousness in AI ... 3. Providing initial evidence that many of the indicator properties can be implemented in AI systems using current techniques ...quote0.742
Summary of contributions.
Caviola & Saad 2025: expert survey finds broad consensus that digital minds capable of subjective experience are plausible within this century, many expecting such systems to proactively claim consciousnessfinding0.741
Expert forecast cited to establish urgency of the research question