quote
active
quote:patching-h-1-with-a-divergent-representation-can-activate-distinct-hidden-pathways-that-result-in-misleadingly-confirmatory-behavior-and-or-undetected-behavior

Patching h[1] with a divergent representation can activate distinct, hidden pathways that result in misleadingly confirmatory behavior and/or undetected behavior.

Load-bearing description of the core pernicious divergence mechanism illustrated in Figure 1

Source paper

extracted_from
Addressing divergent representations from causal interventions on neural networks
(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.