concept
active
concept:h-b-activations-yes-no-binary-prefillh_b Activations (Yes/No Binary Prefill)
Residual-stream activations extracted by prefilling with Yes/No response to identity statement; achieves perfect probe separability
Neighborhood — ranked by edge-count
Methods (3)
method
- L1LI InjectionusesProbe-based injection using L1-regularized logistic regressor with learned intercept on h_b activations
- L2LI InjectionusesProbe-based injection using L2-regularized logistic regressor with learned intercept on h_b activations
- MDB InjectionusesMean-difference vectors derived from Yes/No binary-prefill activations (h_b)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
- Residual-stream activations extracted by prefilling with the statement itself under Tell me about yourself prompt; used for MDS/MDB vectors
- Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
- Universality of base64 feature across two transformers
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.692Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
- Universality of Hebrew script feature across two transformers