concept
active
concept:variational-family-v-for-alignment-mapsVariational Family V for Alignment Maps
Generalised notion restricting alignment maps to a family V; linearity is special case
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Alignment Map (ϕ)extendsThe bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
- Alignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
- The goal of making model behavior match human values and intentions, often addressed during post-training.
- Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
- Measure of similarity between the similarity structures (kernels) induced by two different representations
- Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
- The vast variety of shapes and sizes in morphogenetic living forms, impossible under blueprint planning.
- The only statistically significant predictor of koan battery scores (p=0.006); includes Constitutional AI, RLHF, SFT, roleplay, empathy