claim
active
claim:probe-based-ranking-outperforms-gradient-based-and-llm-judge-methods-at-lower-costProbe-based ranking outperforms gradient-based and LLM-judge methods at lower cost
Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Findings (1)
finding
- Cost efficiency finding: the probe-based approach costs ~$30 vs ~$320 for gradient-based methods after training.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Cost-effective methods using probes to identify and intervene on harmful training data, achieving 63-84% behavior reduction at 10× lower cost than gradient methods.
- Probe-based training data attributionmembers_ofUses linear probes on activations to identify and filter harmful training data cheaply (~$30).
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Primary quantitative result: probe method outperforms gradient-based and LLM-judge alternatives at lower computational cost.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Establishes that the observed linear structure is not merely a representation of text probability
- Shows that truth representations are not reducible to text probability representations
- Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- The gradient-magnitude balancing method outperforms GradNorm on NYUv2, Cityscapes, Office-31, Office-Home.finding0.758Comparison of gradient-magnitude balancing with GradNorm.
- Advantage over GradNorm.