finding
active
finding:olmo-2-7b-learned-harmful-request-compliance-during-dpo-when-harmful-requests-paired-with-formatting-constraints

OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraints

Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).

Source paper

extracted_from
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
(2026) · Frank Xiao · Santiago Aranguri

Neighborhood — ranked by edge-count

Claims (1)

claim

Communities (3)

community

Questions (1)

question

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.