finding

active

finding:probe-based-method-is-approximately-10-cheaper-than-gradient-based-alternatives-30-vs-320-once-trained

Probe-based method is approximately 10× cheaper than gradient-based alternatives ($30 vs $320 once trained)

Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.

Source paper

extracted_from

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Claims (1)

claim

Probe-based ranking outperforms gradient-based and LLM-judge methods at lower cost
supports
Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Probe-based data attribution for LLM safety
members_of
Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
Probe-based training data attribution
members_of
Uses linear probes on activations to identify and filter harmful training data cheaply (~$30).

Methods (1)

method

Probe-Based Data Attribution
supports
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.claim0.724
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.721
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.717
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
Probe-based ranking reduces harmful behavior by 63% via datapoint filteringfinding0.717
Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.712
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.711
Geometric evidence for convergence to stable truth directions only for simpler tasks.
In Cogito v2.1, average residual persistence above variance-matched probes is +0.077 (p = 1.5e-27, 157 of 171 probes positive).finding0.711
Demonstrates emotion-specific persistence beyond variance effects in Cogito
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.710
Shows the passive vs. active divide is more important than the specific wording of instructions.