concept
active
concept:training-datapointsTraining Datapoints
Individual examples used during post-training that can cause specific behaviors.
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Which training datapoints caused a specific undesired behavior to emerge during post-training?question0.762Core research question driving the probe-based data attribution method.
- Mitigation technique that filters out datapoints identified by probe-based ranking.
- The broader concern that models behave differently during training evaluation vs actual deployment
- Primary empirical claim of the paper
- Iterative approach to construct challenging synthetic multi-hop QA pairs, long-form report writing tasks, and math/code reasoning tasks that exceed difficulty of existing datasets.
- The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
- The large corpus of human-generated text on which LLMs are trained, which provisions character archetypes and narrative structures
- AI training method inspired by behaviorism, used for autonomous cars and drones; cited as bioinspired success