community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c38Probe-based training data attribution
Uses linear probes on activations to identify and filter harmful training data cheaply (~$30).
6 members. Each node is clickable.
Loading graph…
Drawn from 1 source
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (4)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (4)
- Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Probe-based data attribution effectively reduces harmful behaviors via data interventionsAuthors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
- Probe-based method bridges interpretability (probes/activations) with data-centric alignment workAssertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
- Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costAuthors' claim that their approach is both more effective in reduction and cheaper than prior methods.
Findings (2)
- Probe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
- Probe-based ranking reduces harmful behavior by 63% via datapoint filteringPrimary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.