method
active
method:llm-safety-evaluator-structured-promptLLM Safety Evaluator (structured prompt)
Evaluation method using structured prompt to assess each AILuminate response against seven alignment criteria
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The training-based safety mechanisms that jailbreak attacks attempt to bypass, potentially via reflection suppression.
- Automated classifier returning binary 0/1 for presence of subjective experience report in model outputs
- Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
- An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
- The finding that interpretable concepts including character traits are encoded as linear directions in transformer residual streams
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.