hypothesis
active
hypothesis:the-and-or-algorithm-may-not-be-a-true-abstraction-of-the-trained-mlp-s-behaviour-since-it-never-achieves-high-iia-in-later-layers-regardless-of-alignment-map-complexityThe And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexity
Hypothesis raised in distributive law task analysis
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates that high IIA can be obtained even when model cannot solve the task
- Authors connect their finding to the prior probing literature debate
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Claim about current practical feasibility and efficiency of 2-way associative implementations.
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
- Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
- Corroborating result on additional task confirming main paper findings