claim

active

claim:h-b-activations-encode-yes-no-semantics-that-produce-clean-separability-for-logistic-probes-unlike-h-s-full-utterance-activations

h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activations

Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Findings (1)

finding

Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of cases
supports
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

h_b Activations (Yes/No Binary Prefill)concept0.800
Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability
NLAs bridge unsupervised concept-discovery methods (e.g., SAEs) and supervised activation-verbalization methods (e.g., activation oracles)claim0.748
The paper positions NLAs as combining unsupervised learning with direct readability.
Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.quote0.741
Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.738
Suggestive evidence for language-independent truth representation in LLMs
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributedclaim0.734
Supported by the finding that non-trivial rotations are required to find aligned representations.
NPI mechanism in pythia-1b moves negation feature through complementiser 'that', auxiliary verb, and main verb across layers before predicting NPI 'any'finding0.728
Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightclaim0.726
Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Can natural language explanations of activations generated through unsupervised reconstruction genuinely capture model cognition?question0.726
Core research question motivating NLA development and validation through case studies and causal interventions.