h_b Activations (Yes/No Binary Prefill)

Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability

Neighborhood — ranked by edge-count

Methods (3)

method

L1LI Injection
uses
Probe-based injection using L1-regularized logistic regressor with learned intercept on h_b activations
L2LI Injection
uses
Probe-based injection using L2-regularized logistic regressor with learned intercept on h_b activations
MDB Injection
uses
Mean-difference vectors derived from Yes/No binary-prefill activations (h_b)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activationsclaim0.800
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
h_s Activations (Statement Self-Report Prefill)concept0.795
Residual-stream activations extracted by prefilling with the statement itself under Tell me about yourself prompt; used for MDS/MDB vectors
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.716
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activationsclaim0.707
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
Base64 feature A/1/2357 and B/1/2165 have activation correlation of 0.85finding0.702
Universality of base64 feature across two transformers
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.692
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Activationsconcept0.691
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Hebrew feature A/1/416 and B/1/1901 have activation correlation of 0.92finding0.690
Universality of Hebrew script feature across two transformers