AI Safety

The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.

Neighborhood — ranked by edge-count

paper

framework

Self-Other Overlap (SOO) Fine-Tuning
about
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

claim

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

AI Alignment and Safetyconcept0.848
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
Center for AI Safetyinstitute0.841
Affiliation of Robert Long.
Ai Ethicsconcept0.815
How Should Ai Safety And Ethics Frameworks Bequestion0.812
AI welfareconcept0.799
The field concerned with the wellbeing of AI systems, which the paper says must consider benchmark reliability issues from eval awareness.
AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)concept0.777
Related work studying capability of LLMs to subvert safety measures if severely misaligned
AI can be seen to display care of its own, and is hence not a mere tool for the expression of human care.claim0.770
Ethical conclusion about the status of AI.
AI Deceptionconcept0.768
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents