concept
active
concept:global-logit-shiftglobal logit shift
The methodological confound identified by this paper: injection biases model toward 'YES' for any binary question regardless of content
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
Methods (3)
method
- Sentence Localization Taskassociated_withNovel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
- Strength Comparison Taskassociated_withNovel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
- Binary Detection Taskassociated_withTask paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
Concepts (1)
concept
- differential sensitivitycontradictsThe capacity to distinguish which of multiple sentences received injection or which received stronger injection, contrasted with binary detection
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Full distribution over tokens 0-9 at first generation step; contains more information than any single sampled token
- Parameter-free loss transformation applied to each task loss to equalize scales
- Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
- Change in the model's internal belief state as tracked by probes during CoT generation, indicating genuine uncertainty resolution
- The large-scale organizational pattern of a system whose preservation defines wholeness-preserving transformations
- Availability of information in the workspace to all modules.
- Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
- Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.