method
active
method:dpoDPO
Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- AlignmentextendsThe goal of making model behavior match human values and intentions, often addressed during post-training.
- Post-TrainingimplementsThe phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.