finding
active
finding:sae-reconstructions-on-llama-3-8b-layer-25-produce-intervened-emd-exceeding-the-natural-natural-baselineSAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baseline
Empirical demonstration that SAE projections produce divergent representations in a real LLM
Source paper
extracted_from(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts
Neighborhood — ranked by edge-count
Claims (1)
claim
- Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Empirical demonstration that MDVP produces divergent representations in a real LLM
- Core result of Experiment 2: deception feature suppression sharply increases experience claims
- Empirical demonstration that DAS interventions produce divergent representations
- Validates representational drift theory: later layers specialize for next-token prediction, increasing dr
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.752Central interpretive claim of the paper supported by causal ablation and activation evidence
- LLaMA-3.1-8B: Sbmax = -1.896 ± 0.211, AUSN = -2.119 ± 0.198, peak layer ℓ* = 10 (median)finding0.752Seed-pooled geometry-only statistics (per-dev z units).
- Goodfire blog post describing SAEs used for Llama models in this study
- Claim that feature grounding enables interpretability metrics.