global logit shift

The methodological confound identified by this paper: injection biases model toward 'YES' for any binary question regardless of content

Neighborhood — ranked by edge-count

claim

method

Sentence Localization Task
associated_with
Novel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
Strength Comparison Task
associated_with
Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
Binary Detection Task
associated_with
Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded

concept

differential sensitivity
contradicts
The capacity to distinguish which of multiple sentences received injection or which received stronger injection, contrasted with binary detection

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Digit-token logit distributionconcept0.749
Full distribution over tokens 0-9 at first generation step; contains more information than any single sampled token
logarithm transformationmethod0.732
Parameter-free loss transformation applied to each task loss to equalize scales
Logit Weight Similaritymethod0.719
Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
Belief Shiftconcept0.710
Change in the model's internal belief state as tracked by probes during CoT generation, indicating genuine uncertainty resolution
Global Structureconcept0.708
The large-scale organizational pattern of a system whose preservation defines wholeness-preserving transformations
Global broadcastconcept0.703
Availability of information in the workspace to all modules.
Logit Weight Analysismethod0.701
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
Logit Lensmethod0.693
Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.