claim

active

claim:probes-trained-under-different-explicit-instruction-variants-are-highly-aligned-with-each-other-despite-different-wording

Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.

Shows the key divide is passive vs. active framing, not the specific wording of instructions.

Source paper

extracted_from

Testing the Limits of Truth Directions in LLMs

(2026) · Angelos Poulis · Mark Crovella · Evimaria Terzi

Neighborhood — ranked by edge-count

Findings (1)

finding

Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.
supports
Shows the passive vs. active divide is more important than the specific wording of instructions.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.793
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Training probes on statements and their opposites improves generalization by mitigating non-truth features with opposite-sign correlationsclaim0.793
Explains why cities+neg_cities and larger_than+smaller_than training sets yield better OOD accuracy
Under ask-correct, probes trained on arithmetic tasks A1-A3 generalize almost perfectly to factual tasks F0-F2 (AUROC ~1.0), whereas under no-prompt this generalization is largely absent.finding0.788
Key improvement in cross-task generalization enabled by explicit instruction framing.
Probe-based method bridges interpretability (probes/activations) with data-centric alignment workclaim0.785
Assertion from the paper's notes that the work connects two previously separate areas: interpretability tools and data-centric alignment.
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.785
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.782
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Are high-accuracy probe representations also causally relevant for the task?question0.779
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.779
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally