claim
active
claim:near-perfect-iia-can-be-achieved-on-randomly-initialised-models-that-cannot-solve-the-task-suggesting-causal-alignment-does-not-require-task-capabilityNear-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capability
Empirical support for vacuousness of unrestricted causal abstraction
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Neighborhood — ranked by edge-count
Findings (2)
finding
- Training progression result showing non-linear maps are uncorrelated with genuine task learning
- Demonstrates that high IIA can be obtained even when model cannot solve the task
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Authors connect their finding to the prior probing literature debate
- Demonstrates the value of the CL auxiliary loss for recovering causal alignments when one model cannot be intervened upon.
- Extrapolation from scale-emergence finding to future risk
- Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.773Central forward-looking hypothesis of the paper motivating the research
- Authors identify this as the most uncertain and important question for future work
- Future work hypothesis about extending SOO to direct value alignment
- Central motivating claim of the paper; supported by empirical comparisons showing RSA/CKA miss Markovian differences detectable by MAS.
- Rules out prompt-level implicit priming for alignment faking independent of query content