Model Misalignment

The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model Alignment Search (MAS)framework0.804
The primary contribution of the paper: a bidirectional causal method that learns rotation matrices for each model to uncover and compare causally relevant latent subspaces across neural networks.
Model Deceptionconcept0.785
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
modelconcept0.780
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
Alignment Between High-Level and Low-Level Modelsconcept0.763
A mapping assigning to each high-level variable a set of low-level variables and a function from low-level to high-level values.
Model Stitchingmethod0.757
Technique to measure representational compatibility by integrating intermediate representations of one model into another
Model Robustnessconcept0.756
Area of AI research that uses interventions to test and improve model resilience to perturbations
model selectionconcept0.756
Comparing models using log-evidence approximated by free energy.
Model Editingconcept0.755
Technique for modifying model knowledge or behavior via targeted interventions, e.g., ROME by Meng et al.