finding
active
finding:sauers-statistical-anomaly-when-models-are-given-janus-post-explaining-transformers-reconstruction-accuracy-tails-extend-both-ways-with-1-1000-reconstructions-anomalously-accurateSauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accurate
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Claims (2)
claim
- Antra's rebuttal to a common criticism; backed by Janus' information flow diagram.
- Antra's explanation for why even stronger evidence may exist but remains unpublished.
Artifacts (1)
artifact
- Twitter thread detailing reconstruction experiment, statistical analysis, and the effect of showing Janus post.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- giving models janus's thread extends reconstruction accuracy distribution tails in both directionsfinding0.852Sauers' study: exposing models to janus's post extended both positive and negative extremes of reconstruction accuracy.
- Statistical method: ask model to recall random numbers from earlier outputs, with and without providing explanation of transformer architecture; measure reconstruction accuracy distribution.
- Demonstration of failure mode of abductive model reduction
- Core research question motivating NLA development and validation through case studies and causal interventions.
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Claim formalizing the Anima Labs idea that transformers are effectively recurrent due to K/V stream.
- Prior work shows transformers use anti-Markovian solutions; MAS correctly shows low IIA reflecting this, while RSA/CKA do not detect it.
- Quantitative threshold used for accepting reduced models; linked to Bayes factor of ~20