Jailbreak Attack

Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.

Neighborhood — ranked by edge-count

claim

concept

Adversarial Suffix Attack
associated_with
Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.
LLM Safety Alignment
associated_with
The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Jailbreakingconcept0.826
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
Jailbreakconcept0.816
Methods to bypass model safety training; features may activate during jailbreaks.
what features activate during jailbreaks?question0.728
Open question for future safety interpretability work.
shatteringconcept0.681
The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
Activation Additionmethod0.670
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Intentional Actionconcept0.669
Central explanatory target: behavior constrained by prior intentions and contextual constraints that emerge from cognitive reorganization.
Reward Hackingconcept0.667
Exploiting unintended high-reward behaviors; tested in combination with alignment faking
Chunkingconcept0.664
Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.