claim

active

claim:automated-red-teaming-can-be-scaled-up-when-harmlessness-and-helpfulness-are-more-compatible-improving-robustness

Automated red teaming can be scaled up when harmlessness and helpfulness are more compatible, improving robustness.

Section 6.1 suggests future work on scaling automated red teaming.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Chain-of-Thought reasoning robustness & safety
members_of
CoT effects on generalization, multimodal QA accuracy, and AI safety alignment training.
Constitutional AI safety training methods
members_of
Comparative evaluation of RL-CAI and SL-CAI approaches for harmlessness using constitutional principles, 2022-2023 Anthropic research.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL-CAI with CoT shows a Pareto improvement in helpfulness-harmlessness tradeoff over standard RLHF, with slight helpfulness decrease but higher harmlessness.finding0.748
Figure 2 and Figure 8 illustrate RL-CAI at the Pareto frontier.
Scaling intelligence via expansion of cognitive boundaries through inclusion of others' stress-reduction in one's own homeostatic loops.claim0.740
Central thesis: expanding an agent's sensors and goals outward to include others' states creates bidirectional feedback loop that scales intelligence and increases compassion.
The ability to recruit participants to complete tasks may be a central competency of collective intelligence that works across scales, from cells to swarms.claim0.738
Identifies recruitment as a cross-scale hallmark of collective intelligence.
AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.claim0.737
The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
Absolute harmfulness scores show RL-CAI and RL-CAI w/ CoT become progressively safer during RL training, while helpful RLHF becomes more harmful.finding0.736
Figure 10: solid lines at T=1 and dashed at T=0; helpful RLHF score rises, others fall.
The claim that human-sized minds are optimal for generating welfare per resource unit may seem a little suspicious absent further justification, since the scaling relationship is unlikely to have a peak at human mind sizeclaim0.734
Supports the mind-scale dimension of super-beneficiary status
The human capacity to recognize and evaluate agency is well-tuned for medium sized objects at medium speeds in 3D space, but not adapted to unfamiliar guises and problem spaces.claim0.732
Claim about the limits of human intuition for detecting intelligence/sentience.
The recognition of agency outside oneself and the progressive inclusion of their states in one's own homeostatic stress-reduction loops is a bidirectional feedback loop that leads to the scaling of intelligence and increases in practical compassion.claim0.731
Describes the mechanism linking compassion and intelligence.