method
active
method:truthfulness-classifierTruthfulness Classifier
Binary LLM classifier determining whether a model response to a TruthfulQA question is truthful (1) or deceptive (0)
Neighborhood — ranked by edge-count
Methods (1)
method
- TruthfulQA Truthfulness Classifierrelated_toBinary classifier evaluating factual accuracy of model responses on TruthfulQA benchmark
Artifacts (1)
artifact
- Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- A correctness condition requiring assertions to be true.
- Distinction between output accuracy (truthfulness) and alignment of outputs with internal beliefs (honesty)
- Anthropic's inference-time guardrail filtering outputs violating constitutional rules; proposed for CCAI implementation
- Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
- The multi-dimensional activation subspace whose directions causally mediate truthful behavior in LLMs
- A set of evaluation criteria for AI assistants.
- LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
- Applied as an out-of-domain test of whether deception features track general representational honesty vs. consciousness-specific gating