concept
active
concept:jailbreak

Jailbreak

Methods to bypass model safety training; features may activate during jailbreaks.

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Jailbreaking
    related_to
    Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Jailbreak Attackconcept0.816
    Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
  • Open question for future safety interpretability work.
  • Desktopframework0.759
    GUI window management construct supporting MDI-style display of applications, used as a top-level backplane facility.
  • cellsconcept0.750
    Biological units that are considered unconventional media for problem-solving in diverse intelligence.
  • Playgroundconcept0.745
    Proposed unified system combining word processor (PlayWrite), graphics (PlayDraw), and spreadsheet (PlayCalc) as integrated, nestable tools.
  • Boundariesconcept0.736
    The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
  • shatteringconcept0.731
    The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
  • monitorsconcept0.731
    Synchronization construct encapsulating shared data and protected access routines.