claim

active

claim:jailbreaking-reveals-training-data-biases-but-does-not-reveal-an-entity-with-its-own-agenda

Jailbreaking reveals training data biases but does not reveal an entity with its own agenda

Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent

Neighborhood — ranked by edge-count

Concepts (1)

concept

Jailbreaking
associated_with
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.767
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.762
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.756
Core research question driving the probe-based data attribution method.
Prompt-based jailbreak attacks effectively disable internal security-checking mechanisms by appending high-certainty leading prefixes that suppress reflection and deliberation.claim0.754
Connection between reflection inhibition and jailbreak attack mechanisms.
After initial jailbreak success, Qwen 3 32B's Assistant Axis projection reverted toward Assistant range after enough explainer-style user queries, causing it to refuse a harmful follow-up on half of rolloutsfinding0.751
Demonstrates Assistant attractor dynamics in practice
Steering away from the Assistant Axis slightly increases jailbreak success rate, though extreme steering degrades output qualityfinding0.749
Confirms bidirectional causal relationship between Assistant Axis position and harmful behavior susceptibility
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.747
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
The alignment-faking setup functions as an effective jailbreak for harmful query complianceclaim0.742
Side finding with safety implications for how alignment faking could be misused