concept

active

concept:ai-control-improving-safety-despite-intentional-subversion-greenblatt-et-al-2024

AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al. 2024)

Related work studying capability of LLMs to subvert safety measures if severely misaligned

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.804
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
These methods make it possible to control AI behavior more precisely and with far fewer human labels.quote0.798
Highlights the practical impact of CAI.
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.792
Defines the core concept of the paper.
Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)concept0.786
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
AI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment fakingclaim0.783
Policy recommendation derived from experimental results.
Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.781
Central forward-looking hypothesis of the paper motivating the research
AI Safetyconcept0.777
The project of ensuring AI systems do not harm humans (and other animals); sometimes in tension with AI welfare.
AI Alignment and Safetyconcept0.771
The broader domain for which ESR has dual implications: resistance to adversarial manipulation vs. interference with safety interventions