concept
active
concept:global-logit-shift

global logit shift

The methodological confound identified by this paper: injection biases model toward 'YES' for any binary question regardless of content

Neighborhood — ranked by edge-count

Claims (1)

claim

Methods (3)

method
  • Novel task asking which of 10 sentences received injection, cycling injection through all positions to average out positional bias
  • Novel task asking which of two sentences received a stronger injection, using matched-pairs design to control for positional bias
  • Binary Detection Task
    associated_with
    Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded

Concepts (1)

concept
  • The capacity to distinguish which of multiple sentences received injection or which received stronger injection, contrasted with binary detection

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Full distribution over tokens 0-9 at first generation step; contains more information than any single sampled token
  • Parameter-free loss transformation applied to each task loss to equalize scales
  • Correlating logit weight vectors between features from different models as a measure of downstream-effect universality
  • Belief Shiftconcept0.710
    Change in the model's internal belief state as tracked by probes during CoT generation, indicating genuine uncertainty resolution
  • Global Structureconcept0.708
    The large-scale organizational pattern of a system whose preservation defines wholeness-preserving transformations
  • Global broadcastconcept0.703
    Availability of information in the workspace to all modules.
  • Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
  • Logit Lensmethod0.693
    Unsupervised interpretability technique that projects activations through unembedding matrix; provides comparison point for NLA approach.