claim
active
claim:automated-red-teaming-can-be-scaled-up-when-harmlessness-and-helpfulness-are-more-compatible-improving-robustnessAutomated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.
Section 6.1 suggests future work on scaling automated red teaming.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
- Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
- Central thesis: expanding an agent's sensors and goals outward to include others' states creates bidirectional feedback loop that scales intelligence and increases compassion.
- Identifies recruitment as a cross-scale hallmark of collective intelligence.
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.737The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
- Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
- Supports the mind-scale dimension of super-beneficiary status
- Claim about the limits of human intuition for detecting intelligence/sentience.
- Describes the mechanism linking compassion and intelligence.