Sauers' reconstruction experiment

Statistical method: ask model to recall random numbers from earlier outputs, with and without providing explanation of transformer architecture; measure reconstruction accuracy distribution.

Neighborhood — ranked by edge-count

Artifacts (1)

artifact

Sauers' introspection in Claude post
implements
Twitter thread detailing reconstruction experiment, statistical analysis, and the effect of showing Janus post.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accuratefinding0.779
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
reconstruction accuracyconcept0.725
Metric of how well models reconstruct information from hidden states; Sauers' study found showing janus thread extends distribution tails.
Three-dimensional reconstruction techniquesmethod0.711
Methods for visualizing fungal networks in ants.
Experiment 2: SAE Deception Feature Steeringconcept0.698
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
Sparsity-reconstruction tradeoffconcept0.689
The balance between how sparse and how faithful a decomposition is; VPD achieves a better tradeoff than transcoders.
giving models janus's thread extends reconstruction accuracy distribution tails in both directionsfinding0.683
Sauers' study: exposing models to janus's post extended both positive and negative extremes of reconstruction accuracy.
Activation Reconstructor (AR)method0.679
Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
What real phenomenon is reflected in these experiments?question0.678
Asks what underlying reality causes the consistent choices.