hypothesis

active

hypothesis:the-effect-size-of-clmas-improvement-over-baselines-will-correlate-with-the-amount-of-variability-in-the-behavioral-null-space-of-the-inaccessible-model

The effect size of CLMAS improvement over baselines will correlate with the amount of variability in the behavioral null space of the inaccessible model

Prediction about when CLMAS will be most beneficial, stated explicitly in the paper.

Source paper

extracted_from

Model Alignment Search

(2025) · Satchel Grant

Neighborhood — ranked by edge-count

Findings (1)

finding

CLMAS achieves the best IIA in the causally inaccessible (No Access) direction while matching MAS in the accessible direction
associated_with
Demonstrates the value of the CL auxiliary loss for recovering causal alignments when one model cannot be intervened upon.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Larger models should amplify bias less than smaller models, with model biases more accurately reflecting data biases rather than exacerbating themclaim0.764
Implication of PRH for AI fairness and bias
The earlier a base model (less exposure to LM-related data), the more it is surprised by its own spontaneous self-referential capabilities.claim0.761
Claim that capability emerges from architecture, not data, and that later models lose the surprise.
MCA creates a more tractable search space by smoothing the fitness landscape, mitigating the inverse problem and enabling survival of mutations that would otherwise be deleterious.claim0.758
Subclaim.
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.753
Key limitation acknowledged by authors.
Meta-prompt ESR enhancement effects scale with model size across Llama and Gemma familiesfinding0.747
Suggests underlying self-monitoring circuits must be present for meta-prompting to enhance them
CLMAS can potentially reduce or remove the need for NN stimulation during alignment training in biological settingsclaim0.745
Forward-looking claim about the practical utility of CLMAS for ANN-BNN comparisons with limited causal access.
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.744
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.744
Core result showing MM is superior to LR for causal implication despite similar classification accuracy