Jailbreaking

Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints

Neighborhood — ranked by edge-count

claim

Jailbreaking reveals training data biases but does not reveal an entity with its own agenda
associated_with
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent

concept

Jailbreak
related_to
Methods to bypass model safety training; features may activate during jailbreaks.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Jailbreak Attackconcept0.826
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
what features activate during jailbreaks?question0.770
Open question for future safety interpretability work.
shatteringconcept0.749
The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
tunnelingconcept0.740
Quantum-physics-inspired notion of a direct connection between matter and the I-plenum, allowing centers to reveal the I.
cellsconcept0.733
Biological units that are considered unconventional media for problem-solving in diverse intelligence.
Boundariesconcept0.731
The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
Chunkingconcept0.729
Rescaling of search to a higher organizational level; hypothesised as intrinsic to ETIs.
Slidingmethod0.729
Dynamic condition: smooth movement of text across the screen.