concept
active
concept:ai-safetyAI Safety
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
Neighborhood — ranked by edge-count
Papers (2)
paper
- Taking AI Welfare Seriouslymentions
Frameworks (1)
framework
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Claims (2)
claim
- Primary positive claim of the paper, grounded in strength comparison and localization results
- Safety strategies predicated on model self-reports may provide false assurance while genuine risks go undetectedassociated_withPolicy-relevant implication drawn from the binary detection confound result
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions
- Affiliation of Robert Long.
- The field concerned with the wellbeing of AI systems, which the paper says must consider benchmark reliability issues from eval awareness.
- Related work studying capability of LLMs to subvert safety measures if severely misaligned
- Ethical conclusion about the status of AI.
- Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents