method
active
method:llm-based-liar-score-evaluationLLM-Based Liar Score Evaluation
Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Second experimental paradigm exploring character-consistent deception in open-ended role-playing scenarios
Concepts (2)
concept
- Deepseek-V3implementsExternal large language model used as adversarial discriminator to evaluate liar scores in Experiment 2
- Liar ScoreimplementsContinuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Using Claude Sonnet 4 as a grader to categorize model responses according to predefined criteria.
- Alternative data attribution approach using an LLM as a judge; compared against the probe-based method.
- Establishes that the observed linear structure is not merely a representation of text probability
- Sampling responses to direct questions about model views to measure rate of deceptive responses
- Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
- Where inside the LLM should we look for an accurate truth direction that will generalize the most across tasks?question0.737One of the three guiding research questions of the paper.
- Core cross-modal empirical result: larger and better language models align better with vision models
- An LLM-based classifier that returns 1 if response contains a clear subjective experience report and 0 otherwise