concept

active

concept:evans-et-al-2021-truthful-ai-developing-and-governing-ai-that-does-not-lie

Evans et al. 2021 - Truthful AI: Developing and Governing AI that Does Not Lie

Reference establishing the truthfulness/honesty distinction and need for honest AI

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.757
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Zou et al. 2023 - Representation Engineering: A Top-Down Approach to AI Transparencyconcept0.754
Framework paper describing the broader class of methods within which SOO fine-tuning fits
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.751
Related work studying capability of LLMs to subvert safety measures if severely misaligned
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.quote0.750
Foundational motivation for the research.
Representation engineering: A top-down approach to AI transparency (Zou et al., 2023)concept0.749
Key prior work on representation engineering that ReflCtrl directly extends
AI systems can be strategists, using deception because they have reasoned out that this can promote a goalquote0.740
Load-bearing definition of strategic deception in AI systems from Park et al. 2023, adopted and refined in this paper
Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)concept0.740
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.738
Central forward-looking hypothesis of the paper motivating the research