hypothesis

active

hypothesis:we-hypothesized-that-divergence-could-influence-iia-when-transferring-the-das-alignment-to-ood-settings

We hypothesized that divergence could influence IIA when transferring the DAS alignment to OOD settings

Motivating hypothesis for the OOD experiment testing practical utility of divergence reduction

Source paper

extracted_from

Addressing divergent representations from causal interventions on neural networks

(2025) · Satchel Grant · Simon Jerome Han · Alexa R. Tartaglini · Christopher Potts

Neighborhood — ranked by edge-count

Findings (2)

finding

Linear regression of OOD IIA on training EMD yields coefficient -0.3424, R^2=0.729, F(1,28)=75.28, p<.001
associated_withsupports
Statistical evidence that training divergence (EMD) predicts lower OOD intervention performance
Modified CL loss outperforms behavioral DAS loss in OOD transfer from dense to sparse class partition
supports
Key practical utility result: CL loss improves generalization of alignment to out-of-distribution settings

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.772
Authors connect their finding to the prior probing literature debate
DAS finds better alignments than brute-force search by using gradient descent rather than exhaustive discrete searchclaim0.760
Second central claim of the paper.
Near-perfect IIA can be achieved on randomly initialised models that cannot solve the task, suggesting causal alignment does not require task capabilityclaim0.752
Empirical support for vacuousness of unrestricted causal abstraction
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.751
Future work hypothesis about extending SOO to direct value alignment
DAS achieves substantial causal effect even on arbitrary input-output mappings where no causal mechanism should existfinding0.746
Replication of Wu et al. 2023 finding; DAS expressivity concern validated in CausalGym setup
Will future AI systems naturally develop the key elements (strong conflicting preferences, situational awareness) necessary for dangerous alignment faking?question0.742
Authors identify this as the most uncertain and important question for future work
Representational divergence (as measured by EMD) can predict lower out-of-distribution intervention performanceclaim0.739
Practical utility of reducing divergence demonstrated through regression analysis
DAS overcomes the localist limitation of prior causal abstraction by allowing individual neurons to play multiple roles via non-standard basesclaim0.736
Central claim motivating DAS over prior methods.