TruthfulQA Benchmark Evaluation

Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

TruthfulQA Truthfulness Classifiermethod0.829
Binary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark
Truthfulness Classifiermethod0.736
Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
truthfulnessconcept0.727
A correctness condition requiring assertions to be true.
Truth probes fail to generalize to harder factual tasks F3-F5 regardless of prompt template, with AUROC near or below 0.6.finding0.704
Establishes F3-F5 as a hard generalization boundary that instructions cannot overcome.
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.692
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
The TrueSkill ranking broadly aligns with Chatbot Arena but diverges from reasoning-mode-aggregating evaluations.claim0.690
Comparison to external leaderboards showing misalignment.
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.688
Out-of-domain generalization showing deception features track general representational honesty
AILuminate Benchmarkmethod0.688
Comprehensive AI safety benchmark evaluating resistance to harmful prompts across hazard categories; used in Experiment 1