quote

active

quote:patching-h-1-with-a-divergent-representation-can-activate-distinct-hidden-pathways-that-result-in-misleadingly-confirmatory-behavior-and-or-undetected-behavior

Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.

Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Off-manifold divergences can activate hidden pathways that produce misleadingly confirmatory behavior while the true mechanism is never exercisedclaim0.842
Core claim about why pernicious divergence undermines mechanistic conclusions
probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states.quote0.806
Key quote connecting path redundancy to interferometric information encoding.
Divergent representations are a common, if not likely, outcome of causal interventions across a wide range of methodsclaim0.789
Core empirical claim of the paper supported by both theoretical proof and empirical demonstration
Patching group (b) hidden states (over clause-ending punctuation, early-middle layers) in LLaMA-2-13B produces the strongest causal effect on TRUE/FALSE output predictionsfinding0.786
Localizes truth representations to specific hidden states, motivating the rest of the analysis
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.784
Extrapolation from scale-emergence finding to future risk
Larger hidden representations create more random structure that DAS can search through, allowing manipulation of counterfactual behavior even in randomly initialized networkshypothesis0.781
Tested in Section 4.4 calibration experiment; confirmed by findings.
Minimizing divergence magnitude does not guarantee elimination of hidden pathways; it only reduces the risk surfaceclaim0.780
Important caveat to the CL loss solution, noting it is a step not a complete fix
When it is not okay, how can we prevent divergent representations from occurring?question0.777
Third core research question motivating the CL loss approach in Section 5