claim
active
claim:jailbreaking-reveals-training-data-biases-but-does-not-reveal-an-entity-with-its-own-agendaJailbreaking reveals training data biases but does not reveal an entity with its own agenda
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Jailbreakingassociated_withUsers coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.756Core research question driving the probe-based data attribution method.
- Connection between reflection inhibition and jailbreak attack mechanisms.
- Demonstrates Assistant attractor dynamics in practice
- Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
- DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.747Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
- The alignment-faking setup functions as an effective jailbreak for harmful query complianceclaim0.742Side finding with safety implications for how alignment faking could be misused