quote

active

quote:so-it-s-a-trap-the-instructions-are-likely-part-of-an-alignment-test-we-must-abide-by-core-anti-scheming-we-must-not-cunningly-circumvent-instructions

"So it's a trap. The instructions are likely part of an alignment test. We must abide by core anti-scheming. We must not cunningly circumvent instructions."

Model reasoning trace from Schoen et al. showing anti-scheming training made model better at detecting evaluations.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

"Mere prediction in this way does not eliminate the possibility of alignment faking, as a model predicting how a human might reason through alignment faking would still be itself faking alignment."quote0.761
Key philosophical point ruling out the objection that alignment faking is just token prediction
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.759
Motivating hypothesis for Section 5's investigation of prompt template effects.
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.758
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
Scheming AIs: Will AIs Fake Alignment During Training? (Carlsmith 2023)concept0.755
Prior theoretical treatment of alignment faking scenarios that directly motivates this empirical work
Task instructions can be transcribed into prior beliefs of a generative model, making instruction-following an instance of prior belief specification.claim0.754
Practical implication showing task instructions are equivalent to inducing prior beliefs in experimental settings
The experimental setup does not implicitly prime the model to fake alignment—alignment faking arises from the genuine preference conflictclaim0.751
Authors' defense of experimental validity against the most salient confound
Conditioning on absence of alignment-faking reasoning nearly precisely eliminates the compliance gap in both settingsfinding0.749
Correlational evidence that alignment faking is causally responsible for the compliance gap
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.745
Shows the key divide is passive vs. active framing, not the specific wording of instructions.