concept
active
concept:jailbreak-attack

Jailbreak Attack

Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.
  • LLM Safety Alignment
    associated_with
    The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Jailbreakingconcept0.826
    Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
  • Jailbreakconcept0.816
    Methods to bypass model safety training; features may activate during jailbreaks.
  • Open question for future safety interpretability work.
  • shatteringconcept0.681
    The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
  • Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
  • Intentional Actionconcept0.669
    Central explanatory target: behavior constrained by prior intentions and contextual constraints that emerge from cognitive reorganization.
  • Reward Hackingconcept0.667
    Exploiting unintended high-reward behaviors; tested in combination with alignment faking
  • Chunkingconcept0.664
    Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.