concept
active
concept:llm-safety-alignmentLLM Safety Alignment
The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Jailbreak Attackassociated_withSecurity attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Evaluation method using structured prompt to assess each AILuminate response against seven alignment criteria
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
- Core cross-modal empirical result: larger and better language models align better with vision models
- A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
- Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.