quote
active
quote:we-would-like-to-train-ai-systems-that-remain-helpful-honest-and-harmless-even-as-some-ai-capabilities-reach-or-exceed-human-level-performanceWe would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance.
Foundational motivation for the research.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Scalability concern.
- Reframes AI not as threat but as preparatory exercise for the harder ethical challenges to come
- Paraphrase of Cantwell Smith's argument; aligns with Buddhist emphasis on seeing reality without conceptual imposition.
- The ethical implication of the identity thesis applied to gradient-based AI training
- Core proposal that machine intelligence can achieve what human effort cannot.
- Load-bearing quote encapsulating the paper's reframing of current AI as preparatory exercise
- Key takeaway from abstract, amended version.
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.