concept
active
concept:post-training

Post-Training

The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.

Neighborhood — ranked by edge-count

Methods (2)

method
  • Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
  • DPO
    implements
    Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

Concepts (2)

concept
  • Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
  • AI Assistant Persona
    associated_with
    The default helpful, honest, and harmless character that post-trained LLMs are taught to embody

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.