LLM judge evaluation

Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.

Neighborhood — ranked by edge-count

paper

finding

method

LLM-judge methods
related_to
Baseline comparison for data attribution; outperformed by probe-based approach.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM-Judge Data Attributionmethod0.846
Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
LLM Judge Binary Classifiermethod0.826
An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
LLM-Based Liar Score Evaluationmethod0.797
Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
LLM Binary Experience Classifiermethod0.769
Automated classifier returning binary 0/1 for presence of subjective experience report in model outputs
Refusal Direction in LLMsconcept0.757
Prior finding that LLM refusal is mediated by a single latent direction, analogous to this paper's reflection direction.
Reflection in LLMsconcept0.756
The core phenomenon studied: the ability of LLMs to evaluate and revise their own reasoning.
Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costclaim0.749
Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.
LLM Internal Representationsconcept0.748
High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.