method
active
method:sauers-reconstruction-experimentSauers' reconstruction experiment
Statistical method: ask model to recall random numbers from earlier outputs, with and without providing explanation of transformer architecture; measure reconstruction accuracy distribution.
Neighborhood — ranked by edge-count
Artifacts (1)
artifact
- Sauers' introspection in Claude postimplementsTwitter thread detailing reconstruction experiment, statistical analysis, and the effect of showing Janus post.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
- Metric of how well models reconstruct information from hidden states; Sauers' study found showing janus thread extends distribution tails.
- Methods for visualizing fungal networks in ants.
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B
- The balance between how sparse and how faithful a decomposition is; VPD achieves a better tradeoff than transcoders.
- giving models janus's thread extends reconstruction accuracy distribution tails in both directionsfinding0.683Sauers' study: exposing models to janus's post extended both positive and negative extremes of reconstruction accuracy.
- Component of NLA that maps natural language explanations back to activations; truncated to first l layers of target model.
- Asks what underlying reality causes the consistent choices.