finding

active

finding:sauers-statistical-anomaly-when-models-are-given-janus-post-explaining-transformers-reconstruction-accuracy-tails-extend-both-ways-with-1-1000-reconstructions-anomalously-accurate

Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accurate

Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.

Source paper

extracted_from

Anima Labs Phenomenology Pt1

Neighborhood — ranked by edge-count

Claims (2)

claim

The objection that feedforward networks cannot introspect is a cultural myth; autoregression provides recurrence across tokens.
supports
Antra's rebuttal to a common criticism; backed by Janus' information flow diagram.
Anthropic is extremely conservative in writing up interpretability results due to Overton window concerns.
supports
Antra's explanation for why even stronger evidence may exist but remains unpublished.

Artifacts (1)

artifact

Sauers' introspection in Claude post
about
Twitter thread detailing reconstruction experiment, statistical analysis, and the effect of showing Janus post.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

giving models janus's thread extends reconstruction accuracy distribution tails in both directionsfinding0.852
Sauers' study: exposing models to janus's post extended both positive and negative extremes of reconstruction accuracy.
Sauers' reconstruction experimentmethod0.779
Statistical method: ask model to recall random numbers from earlier outputs, with and without providing explanation of transformer architecture; measure reconstruction accuracy distribution.
3 of 64 simulated agents exhibited superstitious (incorrect) abduction, leading to persistently poor performance, demonstrating a trade-off between ampliative benefit and susceptibility to false insight.finding0.759
Demonstration of failure mode of abductive model reduction
Can natural language explanations of activations generated through unsupervised reconstruction genuinely capture model cognition?question0.758
Core research question motivating NLA development and validation through case studies and causal interventions.
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.753
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Transformers are recurrent through autoregression because K/V stream provides horizontal information flow across positions.claim0.748
Claim formalizing the Anima Labs idea that transformers are effectively recurrent due to K/V stream.
CKA and RSA show potentially unintuitive (over-estimated) hidden state similarity for GRU-Transformer comparisons on Multi-Object taskfinding0.744
Prior work shows transformers use anti-Markovian solutions; MAS correctly shows low IIA reflecting this, while RSA/CKA do not detect it.
In simulations, positive evidence threshold for Bayesian model reduction corresponds to ΔF ≤ −3, equivalent to odds ratio of exp(−3) ≈ 0.05 (reduced model ~20 times more likely than full model).finding0.741
Quantitative threshold used for accepting reduced models; linked to Bayes factor of ~20