Probe-based training data attribution

Uses linear probes on activations to identify and filter harmful training data cheaply (~$30).

6 members. Each node is clickable.

Loading graph…

Drawn from 1 source

The papers/notes whose extracted claims & findings make up this cluster.

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
Probe-based data attribution effectively reduces harmful behaviors via data interventionsAuthors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Probe-based method bridges interpretability (probes/activations) with data-centric alignment workAssertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costAuthors' claim that their approach is both more effective in reduction and cheaper than prior methods.

Probe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
Probe-based ranking reduces harmful behavior by 63% via datapoint filteringPrimary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.