claim

active

claim:probe-based-data-attribution-effectively-reduces-harmful-behaviors-via-data-interventions

Probe-based data attribution effectively reduces harmful behaviors via data interventions

Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.

Source paper

extracted_from

(2026) · Frank Xiao · Santiago Aranguri

finding

Probe-based ranking reduces harmful behavior by 63% via datapoint filtering
supports
Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Label swapping on flagged datapoints achieves 78% reduction in harmful behavior
supports
Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraints
supports
Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
Removing four problematic data sources achieves 84% reduction in harmful behavior
supports
Key empirical result: removing four identified problematic data sources yields an 84% reduction.
Unsupervised behavior clustering surfaces concerning learned patterns without prior labels
supports
Empirical finding: unsupervised clustering reveals problematic patterns without needing labeled data.

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Probe-based data attribution for LLM safety
members_of
Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
Probe-based training data attribution
members_of
Uses linear probes on activations to identify and filter harmful training data cheaply (~$30).

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.