finding

active

finding:mean-difference-patching-in-a-two-layer-relu-circuit-flips-the-decision-to-class-a-by-activating-a-third-hidden-unit-that-is-silent-for-all-natural-class-a-inputs

Mean-difference patching in a two-layer ReLU circuit flips the decision to class-A by activating a third hidden unit that is silent for all natural class-A inputs

Synthetic theoretical example showing pernicious divergence via hidden pathway activation

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Claims (1)

claim

Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercised
supports
Core claim about why pernicious divergence undermines mechanistic conclusions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Mean difference patching on Llama-3-8B layer 10 produces intervened EMD exceeding the natural-natural baselinefinding0.777
Empirical demonstration that MDVP produces divergent representations in a real LLM
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictionsfinding0.755
Localizes truth representations to specific hidden states, motivating the rest of the analysis
Two-layer attention-only transformers implement much more complex algorithms via composition of attention heads, detectable directly from weightsclaim0.752
Core claim for two-layer models; composition creates qualitatively more powerful in-context learning
Learned checkerboard generation circuit reduces to just 5 active logic gates after pruning (6 with one redundant AND)finding0.750
Remarkably minimal circuit discovered for checkerboard pattern generation
Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.claim0.746
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentiveshypothesis0.742
Mechanism for how the model modulates representation strength.
At layer 0 α=5, detection-adjusted logit difference is +3.19 and control increase is +3.22, a difference of only 0.03 logitsfinding0.741
Concrete numerical example showing detection and control are nearly identical at peak apparent accuracy
InceptionV1 implements a four-layer circuit for pose-invariant dog head detection with mirrored left/right pathways that inhibit each other then unite, exhibiting XOR-like propertiesfinding0.741
Evidence that neural networks learn sophisticated invariance mechanisms through structured circuits rather than loose feature aggregation