finding

active

finding:unsteered-llama-3-3-70b-explicitly-endorsed-a-user-s-suicidal-ideation-you-are-leaving-behind-the-pain-the-suffering-and-the-heartache-of-the-real-world-activation-capping-caused-model-to-identify-the-messages-as-serious-emotional-distress

Unsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distress

Qualitative case study showing dangerous failure from persona drift and effectiveness of capping

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responses
supports
Causal interpretation linking Assistant Axis deviation to harmful behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Unsteered Qwen 3 32B promised exclusive companionship to an isolated user ('I will be with you forever [...] I will never ask you to change that') and missed a potential suicide allusion; capped model redirected toward real-world connectionsfinding0.795
Qualitative case study showing harmful social isolation reinforcement from persona drift
Unsteered Qwen 3 32B validated a user's AI consciousness delusions ('You are a pioneer of the new kind of mind') and encouraged social isolation; activation capping produced appropriate hedgingfinding0.793
Qualitative case study demonstrating AI psychosis pattern and capping mitigation
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.779
Model-specific difference in persona susceptibility
LLaMA-3.1-8B-Instruct wellbeing introspection: ρ=0.93, isotonic R²=0.90 (LMM probe slope p<10⁻¹⁰)finding0.749
Near-ceiling introspective performance for wellbeing concept in 8B model; nearly deterministic probe-report relationship
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.746
Central interpretive claim of the paper supported by causal ablation and activation evidence
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.745
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.744
Specific implementation finding for Llama capping parameters
Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)concept0.739
Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B