method
active
method:linear-alignment-map-linLinear Alignment Map (ϕ_lin)
Alignment map ϕ(h)=W_orth*h using orthogonal matrix; assumes linear representation hypothesis
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- Linear Representation HypothesisimplementsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Methods (1)
method
- Non-Linear Alignment Map (ϕ_nonlin)related_toAlignment map implemented as a reversible residual network (RevNet); assumes non-linear representation hypothesis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied
- Semantic domain for linear transformations; denotation as actual linear function; Category instance generated from homomorphism principle.
- Key empirical result: non-linear maps overcome linear maps' failure in deeper layers
- Replicates Geiger et al. 2024b pattern of layer-dependent IIA degradation with linear maps
- Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
- A straight vector in activation space, traditionally used for concept manipulation; claimed to be insufficient when true concept geometry is curved.
- Linear alignment map ϕ_lin IIA tracks DNN accuracy during Pythia-410m training progression on IOI taskfinding0.773Suggests linear maps may be better correlated with genuine task implementation than non-linear maps
- The goal of making model behavior match human values and intentions, often addressed during post-training.