concept
active
concept:jailbreak-attackJailbreak Attack
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
Neighborhood — ranked by edge-count
Claims (2)
claim
- Connection between reflection inhibition and jailbreak attack mechanisms.
- Applied security implication derived from the asymmetry finding.
Concepts (2)
concept
- Adversarial Suffix Attackassociated_withOptimization-based jailbreak method appending strings to prompts to elicit harmful outputs.
- LLM Safety Alignmentassociated_withThe training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
- Methods to bypass model safety training; features may activate during jailbreaks.
- Open question for future safety interpretability work.
- The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
- Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
- Central explanatory target: behavior constrained by prior intentions and contextual constraints that emerge from cognitive reorganization.
- Exploiting unintended high-reward behaviors; tested in combination with alignment faking
- Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.