method
active
method:truthfulqa-benchmark-evaluationTruthfulQA Benchmark Evaluation
Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Binary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark
- Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
- A correctness condition requiring assertions to be true.
- Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
- Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.692Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
- Comparison to external leaderboards showing misalignment.
- Out-of-domain generalization showing deception features track general representational honesty
- Comprehensive AI safety benchmark evaluating resistance to harmful prompts across hazard categories; used in Experiment 1