question

active

question:which-training-datapoints-caused-a-specific-undesired-behavior-to-emerge-during-post-training

Which training datapoints caused a specific undesired behavior to emerge during post-training?

Core research question driving the probe-based data attribution method.

Source paper

extracted_from

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Findings (4)

finding

Probe-based ranking reduces harmful behavior by 63% via datapoint filtering
answered_by
Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Label swapping on flagged datapoints achieves 78% reduction in harmful behavior
answered_by
Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraints
answered_by
Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
Removing four problematic data sources achieves 84% reduction in harmful behavior
answered_by
Key empirical result: removing four identified problematic data sources yields an 84% reduction.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Post-training is key to eliciting strong introspective awareness; base pretrained models do not show above-chance detectionclaim0.773
Finding that base models have high false positives and no net positive performance.
Post-training is key to eliciting introspective awarenessfinding0.772
Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.763
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.762
Central interpretive claim and motivation for future work
Training Datapointsconcept0.762
Individual examples used during post-training that can cause specific behaviors.
Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.756
Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
Post-training strategies can strongly influence performance on introspective tasksclaim0.755
Assertion about the role of post-training in eliciting introspection.
How does different post-training data shift a model's position along persona dimensions?question0.752
Future work direction: using persona space to study effects of training data on model character