finding

active

finding:sae-reconstructions-on-llama-3-8b-layer-25-produce-intervened-emd-exceeding-the-natural-natural-baseline

SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baseline

Empirical demonstration that SAE projections produce divergent representations in a real LLM

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods
supports
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.870
Empirical demonstration that MDVP produces divergent representations in a real LLM
Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.768
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Boundless DAS interchange interventions produce EMD exceeding natural-natural baselinefinding0.763
Empirical demonstration that DAS interventions produce divergent representations
Systematic layer 20-28 degradation in S(ℓ) to S ≈ −2.40 by layer 27 on LLaMAfinding0.755
Validates representational drift theory: later layers specialize for next-token prediction, increasing dr
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.752
Central interpretive claim of the paper supported by causal ablation and activation evidence
LLaMA-3.1-8B: Sbmax = -1.896 ± 0.211, AUSN = -2.119 ± 0.198, peak layer ℓ* = 10 (median)finding0.752
Seed-pooled geometry-only statistics (per-dev z units).
Announcing Open-Source SAEs for Llama 3.3 70B and Llama 3.1 8B (Balsam et al., 2025)concept0.752
Goodfire blog post describing SAEs used for Llama models in this study
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.752
Claim that feature grounding enables interpretability metrics.