finding
active
finding:toxic-llms-show-higher-iia-when-compared-to-other-toxic-models-than-when-compared-to-nontoxic-models-using-stepwise-masToxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MAS
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.
Neighborhood — ranked by edge-count
Claims (1)
claim
- MAS-like methods could potentially be used to directly constrain model internals to be non-toxicsupportsSpeculative forward-looking claim about practical applications of MAS for model alignment.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
- Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
- Binder et al. finding cited as evidence that LLMs possess introspective capacity analogous to mindfulness
- Establishes that the observed linear structure is not merely a representation of text probability
- Core cross-modal empirical result: larger and better language models align better with vision models
- Key validation gap: the five-scorer validation holds across LLMs but human contemplatives might weight dimensions differently
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Finding from Navigli et al. cited to justify applying human contemplative strategies to AI systems