claim
active
claim:probe-based-data-attribution-effectively-reduces-harmful-behaviors-via-data-interventionsProbe-based data attribution effectively reduces harmful behaviors via data interventions
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Findings (5)
finding
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Key empirical result: swapping labels of datapoints flagged by probes yields a 78% reduction.
- Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
- Key empirical result: removing four identified problematic data sources yields an 84% reduction.
- Empirical finding: unsupervised clustering reveals problematic patterns without needing labeled data.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
- Probe-based training data attributionmembers_ofUses linear probes on activations to identify and filter harmful training data cheaply (~$30).
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Shows that truth representations are not reducible to text probability representations
- Open question raised in §7.1 about an unexplained anomalous result
- Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
- Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.