Model Robustness

Area of AI research that uses interventions to test and improve model resilience to perturbations

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

modelconcept0.766
A representation that captures relevant aspects of a system; according to the theorem, the regulator must embody this.
model selectionconcept0.764
Comparing models using log-evidence approximated by free energy.
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own wayclaim0.761
Author's interpretation of the VTAB alignment results echoing Tolstoy
Model Evidenceconcept0.758
Probability of data under the model, penalizing complexity and rewarding accuracy.
Model Misalignmentconcept0.756
The phenomenon of model internals deviating from desired behavior; MAS is demonstrated to detect this via comparison of toxic vs nontoxic LLMs.
Preference Modelframework0.748
A model trained on comparison data to assign scores to responses, used as reward signal in RLHF/RLAIF.
Autoregressive modelsframework0.742
Second model system studied; used to show why flat autoregressive LLMs struggle with long-range coherence.
hierarchical modelsmethod0.742
Models of sensory generation that allow dynamic context-sensitive prior expectations.