concept
active
concept:jailbreakingJailbreaking
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
Neighborhood — ranked by edge-count
Claims (1)
claim
- Jailbreaking reveals training data biases but does not reveal an entity with its own agendaassociated_withCorrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Concepts (1)
concept
- Jailbreakrelated_toMethods to bypass model safety training; features may activate during jailbreaks.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
- Open question for future safety interpretability work.
- The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
- Quantum-physics-inspired notion of a direct connection between matter and the I-plenum, allowing centers to reveal the I.
- Biological units that are considered unconventional media for problem-solving in diverse intelligence.
- The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
- Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.
- Dynamic condition: smooth movement of text across the screen.