method
active
method:truthfulqa-truthfulness-classifierTruthfulQA Truthfulness Classifier
Binary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark
Neighborhood — ranked by edge-count
Methods (1)
method
- Truthfulness Classifierrelated_toBinary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating
- A correctness condition requiring assertions to be true.
- The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- The paper's operationalization of truthfulness as simple, unambiguous propositional statements that can be labeled true or false
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.714Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
- Distinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)