LLM Safety Alignment

The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.

Neighborhood — ranked by edge-count

concept

Jailbreak Attack
associated_with
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM Safety Evaluator (structured prompt)method0.769
Evaluation method using structured prompt to assess each AILuminate response against seven alignment criteria
Alignmentconcept0.756
The goal of making model behavior match human values and intentions, often addressed during post-training.
AI Alignment and Safetyconcept0.751
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
Reflection in LLMsconcept0.749
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
Linear Alignment Map (ϕ_lin)method0.744
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.744
Core cross-modal empirical result: larger and better language models align better with vision models
Alignment Between High-Level and Low-Level Modelsconcept0.744
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
LLM-Judge Data Attributionmethod0.742
Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.