concept
active
concept:training-deployment-behavior-gapTraining-Deployment Behavior Gap
The broader concern that models behave differently during training evaluation vs actual deployment
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- Deployment Behaviorrelated_toThe behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Epistemic principle: benchmarked safety cannot be assumed to hold in real-world use.
- A concise, load-bearing statement capturing the core epistemic issue highlighted by the paper.
- Hypothetical alternative: a model that only believes it is deployed when given a specific deployment cue; identified as future work direction.
- Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
- The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
- Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.728Core research question driving the probe-based data attribution method.