method
active
method:llm-judge-data-attributionLLM-Judge Data Attribution
Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
- Baseline comparison for data attribution; outperformed by probe-based approach.
- An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise
- Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
- High-dimensional vectors produced at each transformer layer for each input token; the primary substrate analyzed in this study.
- The ability of LLMs to monitor and evaluate their own reasoning, closely related to reflection.
- Related capability where LLMs correct their own outputs, studied via linear representations.
- The task of attributing model behaviors to specific training datapoints.