Jailbreak

Methods to bypass model safety training; features may activate during jailbreaks.

Neighborhood — ranked by edge-count

concept

Jailbreaking
related_to
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Jailbreak Attackconcept0.816
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
what features activate during jailbreaks?question0.781
Open question for future safety interpretability work.
Desktopframework0.759
GUI window management construct supporting MDI-style display of applications, used as a top-level backplane facility.
cellsconcept0.750
Biological units that are considered unconventional media for problem-solving in diverse intelligence.
Playgroundconcept0.745
Proposed unified system combining word processor (PlayWrite), graphics (PlayDraw), and spreadsheet (PlayCalc) as integrated, nestable tools.
Boundariesconcept0.736
The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
shatteringconcept0.731
The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
monitorsconcept0.731
Synchronization construct encapsulating shared data and protected access routines.