question
active
question:which-training-datapoints-caused-a-specific-undesired-behavior-to-emerge-during-post-training

Which training datapoints caused a specific undesired behavior to emerge during post-training?

Core research question driving the probe-based data attribution method.

Source paper

extracted_from
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Findings (4)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.