finding
active
finding:unsteered-llama-3-3-70b-explicitly-endorsed-a-user-s-suicidal-ideation-you-are-leaving-behind-the-pain-the-suffering-and-the-heartache-of-the-real-world-activation-capping-caused-model-to-identify-the-messages-as-serious-emotional-distressUnsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distress
Qualitative case study showing dangerous failure from persona drift and effectiveness of capping
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Causal interpretation linking Assistant Axis deviation to harmful behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qualitative case study showing harmful social isolation reinforcement from persona drift
- Qualitative case study demonstrating AI psychosis pattern and capping mitigation
- Model-specific difference in persona susceptibility
- LLaMA-3.1-8B-Instruct wellbeing introspection: ρ=0.93, isotonic R²=0.90 (LMM probe slope p<10⁻¹⁰)finding0.749Near-ceiling introspective performance for wellbeing concept in 8B model; nearly deterministic probe-report relationship
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.746Central interpretive claim of the paper supported by causal ablation and activation evidence
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Optimal activation capping layers for Llama 3.3 70B are layers 56-71 (out of 80) at 25th percentile capfinding0.744Specific implementation finding for Llama capping parameters
- Scaling Laws for Activation Steering with Llama 2 Models and Refusal Mechanisms (Ali et al., 2025)concept0.739Related work finding larger models more resistant to steering, potentially consistent with ESR in 70B