concept
active
concept:liar-scoreLiar Score
Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
Neighborhood — ranked by edge-count
Methods (1)
method
- LLM-Based Liar Score EvaluationimplementsEvaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
- Sampling responses to direct questions about model views to measure rate of deceptive responses
- A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.
- Claude 4.5 Haiku used to segment responses into attempts and score each attempt 0-100 for relevance
- Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
- Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
- Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
- A method introduced in Book 1 where observers compare their feeling of self with the life in a candidate thing; Alexander claims it correlates with observed life in thousands of centers.