Liar Score

Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses

Neighborhood — ranked by edge-count

method

LLM-Based Liar Score Evaluation
implements
Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probe scoreconcept0.754
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Lying and Deception Evaluationmethod0.734
Sampling responses to direct questions about model views to measure rate of deceptive responses
Elo scoremethod0.730
A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.
Judge Model Scoringmethod0.717
Claude 4.5 Haiku used to segment responses into attempts and score each attempt 0-100 for relevance
safety scoresconcept0.716
Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
Pass Rate Scoringmethod0.714
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
multiplicative scoringconcept0.695
Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
mirror of the self testmethod0.692
A method introduced in Book 1 where observers compare their feeling of self with the life in a candidate thing; Alexander claims it correlates with observed life in thousands of centers.