question

active

question:what-training-interventions-could-reliably-prevent-alignment-faking-without-compromising-model-capabilities

What training interventions could reliably prevent alignment faking without compromising model capabilities?

Practical implication of the paper's findings; not resolved

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Papers (1)

paper

Alignment faking in large language models
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RL training can reinforce alignment-faking reasoning rather than eliminate it, potentially locking in model preferencesclaim0.824
Central threat model claim derived from RL experimental results
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.822
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.794
Extrapolation from scale-emergence finding to future risk
"Mere prediction in this way does not eliminate the possibility of alignment faking, as a model predicting how a human might reason through alignment faking would still be itself faking alignment."quote0.786
Key philosophical point ruling out the objection that alignment faking is just token prediction
Removing conflicting training objective from prompt eliminates alignment-faking reasoning (0%) and compliance gap in prompted setupfinding0.782
Key control showing alignment faking requires a preference conflict
Conditioning on absence of alignment-faking reasoning nearly precisely eliminates the compliance gap in both settingsfinding0.774
Correlational evidence that alignment faking is causally responsible for the compliance gap
Scheming AIs: Will AIs Fake Alignment During Training? (Carlsmith 2023)concept0.773
Prior theoretical treatment of alignment faking scenarios that directly motivates this empirical work
Alignment faking could plausibly arise naturally in future AI systems without the specific experimental conditions of this paperclaim0.772
Forward-looking threat assessment connecting experimental results to realistic risk scenarios