finding

active

finding:toxic-llms-show-higher-iia-when-compared-to-other-toxic-models-than-when-compared-to-nontoxic-models-using-stepwise-mas

Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MAS

Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.

Source paper

extracted_from

Model Alignment Search

(2025) · Satchel Grant

Neighborhood — ranked by edge-count

Claims (1)

claim

MAS-like methods could potentially be used to directly constrain model internals to be non-toxic
supports
Speculative forward-looking claim about practical applications of MAS for model alignment.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Li et al. 2024: larger LLMs outperform smaller ones at distinguishing self-related from non-self-related properties on self-awareness benchmarksfinding0.778
Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs aloneclaim0.769
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
LLMs can predict their own responses more accurately than external observers, implying privileged internal knowledgefinding0.768
Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.768
Establishes that the observed linear structure is not merely a representation of text probability
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.758
Core cross-modal empirical result: larger and better language models align better with vision models
Would experienced meditators rank model responses differently from LLM scorers?question0.754
Key validation gap: the five-scorer validation holds across LLMs but human contemplatives might weight dimensions differently
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.752
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
LLM biases mirror human biases in morally significant waysfinding0.752
Finding from Navigli et al. cited to justify applying human contemplative strategies to AI systems