LLM Safety Evaluator (structured prompt)

Evaluation method using structured prompt to assess each AILuminate response against seven alignment criteria

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM Safety Alignmentconcept0.769
The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
LLM Binary Experience Classifiermethod0.757
Automated classifier returning binary 0/1 for presence of subjective experience report in model outputs
LLM judge evaluationmethod0.748
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
LLM Judge Binary Classifiermethod0.734
An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
Reflection in LLMsconcept0.728
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
LLM-Judge Data Attributionmethod0.720
Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
Linear Representation of Concepts in LLMsconcept0.717
The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
Automated interpretability using LLMs can usefully score feature specificity.claim0.714
Claude 3 Opus ratings aligned with human judgment of feature descriptions.