method

active

method:dpo

DPO

Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

Neighborhood — ranked by edge-count

Papers (1)

paper

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
mentions

Concepts (2)

concept

Alignment
extends
The goal of making model behavior match human values and intentions, often addressed during post-training.
Post-Training
implements
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.