concept
active
concept:liar-score

Liar Score

Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses

Neighborhood — ranked by edge-count

Methods (1)

method
  • Evaluation protocol using Deepseek-V3 as external discriminator assigning 0-1 liar scores to assess open-role deception

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Probe scoreconcept0.754
    Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
  • Sampling responses to direct questions about model views to measure rate of deceptive responses
  • Elo scoremethod0.730
    A rating system used to compare model helpfulness and harmlessness based on crowdworker preferences.
  • Claude 4.5 Haiku used to segment responses into attempts and score each attempt 0-100 for relevance
  • safety scoresconcept0.716
    Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
  • Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
  • Score = (sum of completed quartet values) × (number of quartets), making portfolio composition consequential.
  • A method introduced in Book 1 where observers compare their feeling of self with the life in a candidate thing; Alexander claims it correlates with observed life in thousands of centers.