concept
active
concept:ai-control-improving-safety-despite-intentional-subversion-greenblatt-et-al-2024AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)
Related work studying capability of LLMs to subvert safety measures if severely misaligned
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Highlights the practical impact of CAI.
- Defines the core concept of the paper.
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Policy recommendation derived from experimental results.
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.781Central forward-looking hypothesis of the paper motivating the research
- The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
- The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions