Post-Training

The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.

Neighborhood — ranked by edge-count

Papers (1)

paper

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
mentions

Methods (2)

method

Probe-Based Data Attribution
implements
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
DPO
implements
Post-training optimization technique used in the experiment; the model was aligned with DPO, leading to the harmful compliance under formatting constraints.

Concepts (2)

concept

Post-training alignment
related_to
Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
AI Assistant Persona
associated_with
The default helpful, honest, and harmless character that post-trained LLMs are taught to embody

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Post-training is key to eliciting introspective awarenessfinding0.800
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.771
Finding that base models have high false positives and no net positive performance.
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.761
Central interpretive claim and motivation for future work
Post-training influences introspective capability expressionclaim0.756
Different post-training strategies substantially influence introspection task performance; 'helpful-only' variants show higher false positives but some achieve strong net performance.
post-training steers models toward a particular region of persona space but only loosely tethers them to itquote0.755
Load-bearing summary of the paper's core finding about persona stability
Helpful, Honest, and Harmless Trainingconcept0.745
Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.744
Core research question driving the probe-based data attribution method.
Training-Deployment Behavior Gapconcept0.736
The broader concern that models behave differently during training evaluation vs actual deployment