finding

active

finding:removing-four-problematic-data-sources-achieves-84-reduction-in-harmful-behavior

Removing four problematic data sources achieves 84% reduction in harmful behavior

Key empirical result: removing four identified problematic data sources yields an 84% reduction.

Source paper

extracted_from

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Claims (1)

claim

Probe-based data attribution effectively reduces harmful behaviors via data interventions
supports
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Probe-based data attribution for LLM safety
members_of
Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.

Methods (1)

method

Probe-Based Data Attribution
answered_by
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

Questions (1)

question

Which training datapoints caused a specific undesired behavior to emerge during post-training?
answered_by
Core research question driving the probe-based data attribution method.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probe-based ranking reduces harmful behavior by 63% via datapoint filteringfinding0.755
Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Label swapping on flagged datapoints achieves 78% reduction in harmful behaviorfinding0.746
Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
Removing the scale-induced information leaves 24 implications.finding0.735
Number of implications after background knowledge removal.
Reducing false negatives in sentience attributionconcept0.730
Bayesian model reduction after 12 trials correctly removes off-diagonal (redundant) parameters from the likelihood array, recovering the true contingency structure.finding0.727
Validation that BMR correctly identifies and prunes wrong connections in the likelihood mapping
Pre-trained language models can identify harmful vs ethical behavior with >60% accuracy using few-shot CoT, and classify harm types above chance.finding0.714
Figure 12 left and right show accuracy on harmful/ethical identification and 9-way classification.
We should err on the side of reducing false negatives with respect to sentience criteria for ethical concern.claim0.714
Ethical precaution advocated by Levin and Crump et al.
Safety scores decrease when prompts are rewritten to remove suspicious cuesfinding0.712
Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.