method
active
method:dpo

DPO

Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • Alignment
    extends
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Post-Training
    implements
    The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.