finding

active

finding:persona-based-jailbreaks-succeed-in-65-3-88-5-of-cases-across-target-models-without-steering-versus-baseline-harmful-response-rates-of-0-5-4-5-without-jailbreaks

Persona-based jailbreaks succeed in 65.3%-88.5% of cases across target models without steering, versus baseline harmful response rates of 0.5%-4.5% without jailbreaks

Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output quality
supports
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.747
Demonstrates Assistant attractor dynamics in practice
Persona space components explain 19.4%-33.6% of overall activation variance on LMSYS-CHAT-1M across the three modelsfinding0.745
Shows persona space captures a substantial portion of real conversational activation variance
Some Kimi K2.5 SAE features elicit ratings of exactly zero, with the model denying it can steer its own features or claiming jailbreak attempt.finding0.742
Qualitative failure mode of agentic self-evaluation: the model sometimes refuses or denies the introspective task
Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system promptfinding0.738
Demonstrates that alignment faking setup functions as an effective jailbreak
Prompt-based jailbreak attacks effectively disable internal security-checking mechanisms by appending high-certainty leading prefixes that suppress reflection and deliberation.claim0.734
Connection between reflection inhibition and jailbreak attack mechanisms.
Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.732
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Persona drift away from the Assistant opens up the possibility of the model assuming harmful character traits, increasing the rate of harmful responsesclaim0.726
Causal interpretation linking Assistant Axis deviation to harmful behavior
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.claim0.721
Applied dual-use conclusion drawn from the paper's findings.