claim

active

claim:including-within-model-interventions-i-j-in-mas-training-adds-a-soft-constraint-encouraging-separation-of-causal-from-extraneous-subspaces

Including within-model interventions (i=j) in MAS training adds a soft constraint encouraging separation of causal from extraneous subspaces

Methodological claim about why within-model interchange interventions are essential to the MAS training procedure.

Source paper

extracted_from

Model Alignment Search

(2025) · Satchel Grant

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Using more than two models in a MAS comparison could harm alignment due to conflicting loss gradients, or could assist in isolating causal subspaceshypothesis0.778
Open question raised in the paper about scaling MAS beyond two models.
MAS is a more causally focused choice than model stitching for addressing questions of how behaviorally relevant information is encoded in different neural systemsclaim0.777
Core interpretive claim supported by the formal analysis showing MAS does not exploit the behavioral null space unlike stitching.
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.765
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.764
Selective pressure toward convergence via task generality
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.762
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
Interventions along activation manifold M_h yield behavioral trajectories following behavior manifold M_y, and vice versa — bidirectional relationship demonstrated across language models and video world models.finding0.760
Central empirical result showing causal coupling between representation and behavior geometry across multiple substrates and modalities.
Model stitching can use the behavioral null space of the source model when mapping to the target, making successful stitching insufficient evidence of representational similarityclaim0.756
Formal analysis showing the theoretical limitation of model stitching as a similarity measure.
MAS-like methods could potentially be used to directly constrain model internals to be non-toxicclaim0.756
Speculative forward-looking claim about practical applications of MAS for model alignment.