LLM-Judge Data Attribution

Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLM judge evaluationmethod0.846
Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
LLM-judge methodsmethod0.836
Baseline comparison for data attribution; outperformed by probe-based approach.
LLM Judge Binary Classifiermethod0.809
An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
LLM-Based Liar Score Evaluationmethod0.778
Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
LLM Internal Representationsconcept0.772
High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.
LLM Meta-Cognitionconcept0.767
The ability of LLMs to monitor and evaluate their own reasoning, closely related to reflection.
LLM Self-Correctionconcept0.766
Related capability where LLMs correct their own outputs, studied via linear representations.
Data Attributionconcept0.765
The task of attributing model behaviors to specific training datapoints.