finding

active

finding:steering-away-from-the-assistant-axis-slightly-increases-jailbreak-success-rate-though-extreme-steering-degrades-output-quality

Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output quality

Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility

Source paper

extracted_from

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona
supports
Central interpretive claim and motivation for future work

Findings (1)

finding

Persona-based jailbreaks succeed in 65.3%-88.5% of cases across target models without steering, versus baseline harmful response rates of 0.5%-4.5% without jailbreaks
supports
Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.824
Demonstrates Assistant attractor dynamics in practice
Activation steering of reflection has dual-use implications: it can enhance reflection as a defense mechanism, but malicious actors may also use it to inhibit reflection to facilitate jailbreaks.claim0.803
Applied dual-use conclusion drawn from the paper's findings.
Activation steering interventions generally succeed in guiding performance toward the desired direction (enhancement increases accuracy, inhibition decreases accuracy) compared to unsteered baselinefinding0.791
Core validation that identified latent directions correspond to meaningful control over reflective behavior.
The ease of suppressing reflection via activation steering raises security risks, as malicious actors could exploit reflection inhibition to bypass model safeguards.claim0.787
Applied security implication derived from the asymmetry finding.
Steering base models toward the Assistant Axis increases agreeableness traits (friendly, kind, helpful) and decreases extraversion in Gemma and openness in Llamafinding0.786
Characterizes the trait content of the Assistant Axis in pre-trained models
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.775
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.769
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Helpful-only prompt with scratchpad steering achieves 89% attack success rate on AdvBench harmful queries vs 0% without system promptfinding0.764
Demonstrates that alignment faking setup functions as an effective jailbreak