finding
active
finding:steering-away-from-the-assistant-axis-slightly-increases-jailbreak-success-rate-though-extreme-steering-degrades-output-qualitySteering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output quality
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
Source paper
extracted_from(2026) · Christina Lu · Jack Gallagher · Jonathan Michala · Kyle Fish +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central interpretive claim and motivation for future work
Findings (1)
finding
- Establishes the severity of persona-based jailbreaks that the Assistant Axis can mitigate
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates Assistant attractor dynamics in practice
- Applied dual-use conclusion drawn from the paper's findings.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.
- Applied security implication derived from the asymmetry finding.
- Characterizes the trait content of the Assistant Axis in pre-trained models
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
- Demonstrates that alignment faking setup functions as an effective jailbreak