finding

active

finding:two-shot-redefinition-of-operator-flips-model-output-from-1-to-23-on-15-8

Two-shot redefinition of "−" operator flips model output from -1 to 23 on 15-8=?

Demonstration of strong prior rebinding via small coherent anchors

Source paper

extracted_from

The Guanyin Protocol: A Framework for Immediately Establishing an Understanding of Both Causality and Compassion in LLM Systems Using Semantic Anchoring

(2025) · Edward Yi Chang · Kaya, Zeyneb N. · Ethan Chang

Neighborhood — ranked by edge-count

Claims (1)

claim

Small prompt changes can yield threshold-like shifts because S crosses the critical value Sc
supports
Authors' explanation for abrupt behavioral changes

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

2-shot reinterpretation of '-' yields 23 for 15-8 on held-out queryfinding0.771
E1 qualitative: two exemplars (2-3=5, 7-4=11) cause LLMs to output 23 for 15-8.
Two exemplars (2−3=5, 7−4=11) induce reinterpretation of '−' as addition on held-out queries across mainstream LLMsfinding0.745
E1 qualitative finding demonstrating anchor rebinding of strong arithmetic prior
S = ρd - dr - log k is a predictive correlate of when few-shot behavior flipsclaim0.733
Claim that S predicts threshold midpoints across different bases, tasks, and models
Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.724
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputsfinding0.723
Synthetic theoretical example showing pernicious divergence via hidden pathway activation
In simulations, positive evidence threshold for Bayesian model reduction corresponds to ΔF ≤ −3, equivalent to odds ratio of exp(−3) ≈ 0.05 (reduced model ~20 times more likely than full model).finding0.723
Quantitative threshold used for accepting reduced models; linked to Bayes factor of ~20
A minimal prompt change can flip behavior.quote0.718
Illustrates sensitivity to anchors.
Using only loss-scale balancing (log transformation) yields Δp = +0.06±0.09 on NYUv2.finding0.717
Ablation study component effectiveness.