community
active
leiden_hybrid_concepts
label: haiku
community:leiden_hybrid_concepts-run4-c6-c7AI-supervised alignment and scalable oversight
Methods for training safe AI systems using AI feedback instead of human labels, scaling supervision as capabilities grow.
6 members. Each node is clickable.
Loading graph…
Drawn from 3 sources
The papers/notes whose extracted claims & findings make up this cluster.
- CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence4 members
- Toward an ethics of autopoietic technology: Stress, care, and intelligence1 member
- 2026-05-14_phil-trans-A-goodfire-aboutblank-impact.md1 member
Bridges (2)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (6)
- AI feedback can effectively replace human feedback for harmlessness in RLHF-style training.The paper demonstrates that RLAIF with constitutional principles matches or exceeds HH RLHF.
- All intelligent agents—biological, technological, or hybrid—can be assessed via stress-care-intelligence loops regardless of substrate.
- Chain-of-thought reasoning improves the transparency and performance of AI decision making in harmlessness evaluation.CoT improves accuracy on HHH evals and makes the decision process legible.
- Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Scaling supervision through AI self-improvement is feasible and may be necessary as AI capabilities advance.The paper provides evidence that AI can help supervise AI, reducing reliance on humans.
- Introducing AI agents into human-learning populations reduces costly individual learning and depletes information supply.