Training Datapoints

Individual examples used during post-training that can cause specific behaviors.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.762
Core research question driving the probe-based data attribution method.
Datapoint Filteringmethod0.746
Mitigation technique that filters out datapoints identified by probe-based ranking.
Training-Deployment Behavior Gapconcept0.707
The broader concern that models behave differently during training evaluation vs actual deployment
There is a growing similarity in how datapoints are represented in different neural network models, spanning different architectures, training objectives, and data modalitiesclaim0.706
Primary empirical claim of the paper
Training Data Synthesis Pipelinemethod0.705
Iterative approach to construct challenging synthetic multi-hop QA pairs, long-form report writing tasks, and math/code reasoning tasks that exceed difficulty of existing datasets.
Post-Trainingconcept0.699
The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
Internet-scale Training Corpusconcept0.698
The large corpus of human-generated text on which LLMs are trained, which provisions character archetypes and narrative structures
Deep Reinforcement Learningmethod0.692
AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success