concept
active
concept:jailbreaking

Jailbreaking

Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints

Neighborhood — ranked by edge-count

Claims (1)

claim

Concepts (1)

concept
  • Jailbreak
    related_to
    Methods to bypass model safety training; features may activate during jailbreaks.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Jailbreak Attackconcept0.826
    Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
  • Open question for future safety interpretability work.
  • shatteringconcept0.749
    The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
  • tunnelingconcept0.740
    Quantum-physics-inspired notion of a direct connection between matter and the I-plenum, allowing centers to reveal the I.
  • cellsconcept0.733
    Biological units that are considered unconventional media for problem-solving in diverse intelligence.
  • Boundariesconcept0.731
    The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
  • Chunkingconcept0.729
    Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.
  • Slidingmethod0.729
    Dynamic condition: smooth movement of text across the screen.