finding
active
finding:probe-based-method-is-approximately-10-cheaper-than-gradient-based-alternatives-30-vs-320-once-trainedProbe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)
Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
- Probe-based training data attributionmembers_ofUses linear probes on activations to identify and filter harmful training data cheaply (~$30).
Methods (1)
method
- Probe-Based Data AttributionsupportsLinear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Shows the key divide is passive vs. active framing, not the specific wording of instructions.
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Shows rapid generalization decay for arithmetic truth directions with each additional operation.
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Demonstrates emotion-specific persistence beyond variance effects in Cogito
- Shows the passive vs. active divide is more important than the specific wording of instructions.