claim

active

claim:near-perfect-iia-can-be-achieved-on-randomly-initialised-models-that-cannot-solve-the-task-suggesting-causal-alignment-does-not-require-task-capability

Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capability

Empirical support for vacuousness of unrestricted causal abstraction

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Findings (2)

finding

8-layer ϕ_nonlin achieves near-perfect IIA on Pythia-410m at all training steps including random initialisation on IOI task
supports
Training progression result showing non-linear maps are uncorrelated with genuine task learning
Over 80% IIA achieved using complex non-linear alignment maps on randomly initialised MLPs in hierarchical equality task
supports
Demonstrates that high IIA can be obtained even when model cannot solve the task

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.812
Authors connect their finding to the prior probing literature debate
CLMAS achieves the best IIA in the causally inaccessible (No Access) direction while matching MAS in the accessible directionfinding0.789
Demonstrates the value of the CL auxiliary loss for recovering causal alignments when one model cannot be intervened upon.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.779
Extrapolation from scale-emergence finding to future risk
Future more capable AI systems are at risk of alignment faking, whether for benign or malicious goalshypothesis0.773
Central forward-looking hypothesis of the paper motivating the research
Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.761
Authors identify this as the most uncertain and important question for future work
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.757
Future work hypothesis about extending SOO to direct value alignment
Correlative methods like RSA and CKA are insufficient for determining functional similarity between neural systems; causal methods are necessaryclaim0.753
Central motivating claim of the paper; supported by empirical comparisons showing RSA/CKA miss Markovian differences detectable by MAS.
Alignment-faking reasoning is almost absent (~2%) when evaluating model on benign unrelated queries with the harmful-only promptfinding0.753
Rules out prompt-level implicit priming for alignment faking independent of query content