finding

active

finding:giving-models-janus-s-thread-extends-reconstruction-accuracy-distribution-tails-in-both-directions

giving models janus's thread extends reconstruction accuracy distribution tails in both directions

Sauers' study: exposing models to janus's post extended both positive and negative extremes of reconstruction accuracy.

Source paper

extracted_from

Janus Information Flow Transformers 2025

Neighborhood — ranked by edge-count

Claims (1)

claim

Information from point A to B can travel through C(m+n, n) distinct paths, which quickly exceeds the number of atoms in the visible universe.
supports
Janus's mathematical claim about exponential path combinatorics in transformers.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accuratefinding0.852
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.748
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.744
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Model stitching can use the behavioral null space of the source model when mapping to the target, making successful stitching insufficient evidence of representational similarityclaim0.741
Formal analysis showing the theoretical limitation of model stitching as a similarity measure.
The model converges to a more stable truth direction in middle-to-late layers, as evidenced by increasing cosine similarity between layer-wise probes.claim0.734
Supported by the geometric transition visible in cosine similarity heatmaps for F0-F3.
Steering vector extracted from final post-expert-iteration model also successfully elicits deployment behaviorfinding0.733
Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.731
Shows model persona position is primarily determined by the most recent user message, not prior drift
probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states.quote0.730
Key quote connecting path redundancy to interferometric information encoding.