Truthfulness Classifier

Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)

Neighborhood — ranked by edge-count

method

TruthfulQA Truthfulness Classifier
related_to
Binary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

truthfulnessconcept0.869
A correctness condition requiring assertions to be true.
Truthfulness vs. Honesty Distinctionconcept0.764
Distinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)
Constitutional Classifiersmethod0.752
Anthropic's inference-time guardrail filtering outputs violating constitutional rules; proposed for CCAI implementation
Adversarial Manipulation of Truthfulnessconcept0.744
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Truth Subspaceconcept0.743
The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
Helpful, Honest, Harmlessframework0.740
A set of evaluation criteria for AI assistants.
Alignment-Faking Reasoning Classifiermethod0.740
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
TruthfulQA Benchmark Evaluationmethod0.736
Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating