claim
active
claim:h-b-activations-encode-yes-no-semantics-that-produce-clean-separability-for-logistic-probes-unlike-h-s-full-utterance-activationsh_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activations
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Findings (1)
finding
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability
- The paper positions NLAs as combining unsupervised learning with direct readability.
- Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1
- Suggestive evidence for language-independent truth representation in LLMs
- Supported by the finding that non-trivial rotations are required to find aligned representations.
- Mechanistic finding from CausalGym case study showing multi-step information movement in NPI mechanism
- Motivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
- Core research question motivating NLA development and validation through case studies and causal interventions.