finding

active

finding:probes-trained-on-h-b-activations-achieve-perfect-test-accuracy-in-every-case-h-s-probes-achieve-perfect-accuracy-in-only-0-60-of-cases

Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of cases

Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Claims (1)

claim

h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activations
supports
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For LLaMA-2-70B, probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans regardless of probing techniquefinding0.800
Striking cross-domain generalization result supporting the claim that larger models represent abstract truth
Probes trained on A1 degrade significantly when evaluated on A2 and more on A3; training on A2 achieves only AUROC ~0.65 on A3.finding0.792
Shows rapid generalization decay for arithmetic truth directions with each additional operation.
Probes trained under different explicit instruction prompts (ask-correct, ask-t/f, ask-able, ask-arith) are highly aligned with each other in cosine similarity.finding0.790
Shows the passive vs. active divide is more important than the specific wording of instructions.
Probes trained on the likely dataset perform worse than chance on datasets with anti-correlations between text probability and truthfinding0.789
Shows that truth representations are not reducible to text probability representations
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.784
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Probes trained under different explicit instruction variants are highly aligned with each other despite different wording.claim0.782
Shows the key divide is passive vs. active framing, not the specific wording of instructions.
MM probes trained on larger_than+smaller_than achieve lower NIE than those trained on cities+neg_cities despite higher classification accuracy on sp_en_transfinding0.780
Dissociation between classification accuracy and causal implication; training on opposites does not always help causally
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.775
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans