concept
active
concept:alignment-between-high-level-and-low-level-modelsAlignment Between High-Level and Low-Level Models
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
Neighborhood — ranked by edge-count
Methods (1)
method
- The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
Concepts (1)
concept
- Acyclic Causal Modelassociated_withConsists of input, intermediate, and output variables with associated causal mechanisms; the mathematical object central to DAS.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Scale-dependent structural finding from PCA visualizations in §4
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Core cross-modal empirical result: larger and better language models align better with vision models
- Open methodological question acknowledged as limitation
- The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.