concept
active
concept:evans-et-al-2021-truthful-ai-developing-and-governing-ai-that-does-not-lieEvans et al. 2021 - Truthful AI: Developing and Governing AI that Does Not Lie
Reference establishing the truthfulness/honesty distinction and need for honest AI
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Framework paper describing the broader class of methods within which SOO fine-tuning fits
- Related work studying capability of LLMs to subvert safety measures if severely misaligned
- Foundational motivation for the research.
- Key prior work on representation engineering that ReflCtrl directly extends
- Load-bearing definition of strategic deception in AI systems from Park et al. 2023, adopted and refined in this paper
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.738Central forward-looking hypothesis of the paper motivating the research