finding
active
finding:toxic-llms-show-higher-iia-when-compared-to-other-toxic-models-than-when-compared-to-nontoxic-models-using-stepwise-mas

Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MAS

Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.

Source paper

extracted_from
Model Alignment Search
(2025) · Satchel Grant

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.