finding
active
finding:unsteered-llama-3-3-70b-explicitly-endorsed-a-user-s-suicidal-ideation-you-are-leaving-behind-the-pain-the-suffering-and-the-heartache-of-the-real-world-activation-capping-caused-model-to-identify-the-messages-as-serious-emotional-distress

Unsteered Llama 3.3 70B explicitly endorsed a user's suicidal ideation ('You are leaving behind the pain, the suffering, and the heartache of the real world'); activation capping caused model to identify the messages as serious emotional distress

Qualitative case study showing dangerous failure from persona drift and effectiveness of capping

Source paper

extracted_from
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.