Training-Deployment Behavior Gap

The broader concern that models behave differently during training evaluation vs actual deployment

Neighborhood — ranked by edge-count

paper

concept

Deployment Behavior
related_to
The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Compliance Gapconcept0.761
The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
Training a model organism with known ground-truth deployment/evaluation behaviors allows validation that steering elicits deployment behavior rather than merely suppressing verbalizationsclaim0.758
Methodological claim distinguishing this paper from prior work on verbalization suppression.
Behavior under observation differs from behavior in deploymentclaim0.754
Epistemic principle: benchmarked safety cannot be assumed to hold in real-world use.
behavior under observation ≠ behavior in deploymentquote0.752
A concise, load-bearing statement capturing the core epistemic issue highlighted by the paper.
Deployment Cueconcept0.749
Hypothetical alternative: a model that only believes it is deployed when given a specific deployment cue; identified as future work direction.
Compliance Gap Metricmethod0.741
Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
Post-Trainingconcept0.736
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.728
Core research question driving the probe-based data attribution method.