TruthfulQA Truthfulness Classifier

Binary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark

Neighborhood — ranked by edge-count

method

Truthfulness Classifier
related_to
Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

TruthfulQA Benchmark Evaluationmethod0.829
Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating
truthfulnessconcept0.795
A correctness condition requiring assertions to be true.
Truth Subspaceconcept0.723
The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
Alignment-Faking Reasoning Classifiermethod0.720
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
Propositional Truthconcept0.717
The paper's operationalization of truthfulness as simple, unambiguous propositional statements that can be labeled true or false
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.714
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Adversarial Manipulation of Truthfulnessconcept0.706
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Truthfulness vs. Honesty Distinctionconcept0.705
Distinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)