finding

active

finding:mean-difference-patching-on-llama-3-8b-layer-10-produces-intervened-emd-exceeding-the-natural-natural-baseline

Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baseline

Empirical demonstration that MDVP produces divergent representations in a real LLM

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methods
supports
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE reconstructions on Llama-3-8B layer 25 produce intervened EMD exceeding the natural-natural baselinefinding0.870
Empirical demonstration that SAE projections produce divergent representations in a real LLM
Math and code tasks show strongest mid-layer anchoring on LLaMA (S ≈ −1.65 at layers 8-12)finding0.790
Task-specific E3 finding showing compositional reasoning requires deeper processing
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictionsfinding0.781
Localizes truth representations to specific hidden states, motivating the rest of the analysis
LLaMA-3.1-8B: Sbmax = -1.896 ± 0.211, AUSN = -2.119 ± 0.198, peak layer ℓ* = 10 (median)finding0.780
Seed-pooled geometry-only statistics (per-dev z units).
Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputsfinding0.777
Synthetic theoretical example showing pernicious divergence via hidden pathway activation
Systematic layer 20-28 degradation in S(ℓ) to S ≈ −2.40 by layer 27 on LLaMAfinding0.773
Validates representational drift theory: later layers specialize for next-token prediction, increasing dr
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.771
Central interpretive claim of the paper supported by causal ablation and activation evidence
The case at approximately the 2/3 layer of LLaMA3.1-8B (Layer 24, satisfying Criteria 1 and 2) aligns with prior studies showing the 2/3 layer optimally predicts human brain activity.finding0.770
Connects this study's results to Schrimpf et al. 2021 and Caucheteux et al. 2022/2023 findings on brain-LLM alignment.