concept
active
concept:post-trainingPost-Training
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (2)
method
- Probe-Based Data AttributionimplementsLinear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- DPOimplementsPost-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.
Concepts (2)
concept
- Post-training alignmentrelated_toBroader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
- AI Assistant Personaassociated_withThe default helpful, honest, and harmless character that post-trained LLMs are taught to embody
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- Finding that base models have high false positives and no net positive performance.
- Central interpretive claim and motivation for future work
- Different post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
- Load-bearing summary of the paper's core finding about persona stability
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.744Core research question driving the probe-based data attribution method.
- The broader concern that models behave differently during training evaluation vs actual deployment