finding
active
finding:removing-four-problematic-data-sources-achieves-84-reduction-in-harmful-behaviorRemoving four problematic data sources achieves 84% reduction in harmful behavior
Key empirical result: removing four identified problematic data sources yields an 84% reduction.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
Methods (1)
method
- Probe-Based Data Attributionanswered_byLinear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Questions (1)
question
- Which training datapoints caused a specific undesired behavior to emerge during post-training?answered_byCore research question driving the probe-based data attribution method.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
- Number of implications after background knowledge removal.
- Validation that BMR correctly identifies and prunes wrong connections in the likelihood mapping
- Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
- Ethical precaution advocated by Levin and Crump et al.
- Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.