method
active
method:llm-judge-evaluationLLM judge evaluation
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- LLM judge (deepseek-v3) agrees with human evaluator on 91.6% of 200 sampled jailbreak responsessupportsValidates the LLM-based harm evaluation rubric
Methods (1)
method
- LLM-judge methodsrelated_toBaseline comparison for data attribution; outperformed by probe-based approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
- An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
- Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
- Automated classifier returning binary 0/1 for presence of subjective experience report in model outputs
- Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
- The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
- Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.
- High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.