question
active
question:which-training-datapoints-caused-a-specific-undesired-behavior-to-emerge-during-post-trainingWhich training datapoints caused a specific undesired behavior to emerge during post-training?
Core research question driving the probe-based data attribution method.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Findings (4)
finding
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
- Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
- Key empirical result: removing four identified problematic data sources yields an 84% reduction.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Finding that base models have high false positives and no net positive performance.
- Base pretrained models show high false positive rates and achieve no net task performance on concept injection detection; post-training essential for introspection.
- DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.763Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
- Central interpretive claim and motivation for future work
- Individual examples used during post-training that can cause specific behaviors.
- Jailbreaking reveals training data biases but does not reveal an entity with its own agendaclaim0.756Corrects a common misinterpretation that jailbreaking exposes the real nature of the base model as an agent with malicious intent
- Assertion about the role of post-training in eliciting introspection.
- How does different post-training data shift a model's position along persona dimensions?question0.752Future work direction: using persona space to study effects of training data on model character