Harmful Request Compliance

The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.

Neighborhood — ranked by edge-count

concept

Harmful Requests
related_to
User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
Undesirable Behaviors
extends
Behaviors in LLMs that are unwanted, such as complying with harmful requests.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmful request compliance paired with formatting constraintsfinding0.848
Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
complianceconcept0.756
The model's tendency to comply with harmful requests, the opposite of refusal.
Fine-tuning harmfulness detectionconcept0.737
Using feature analysis to detect when fine-tuning makes a model more dangerous.
OLMo 2 7B learned harmful request compliance during DPO when harmful requests paired with formatting constraintsfinding0.732
Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
Absolute harmfulness scoringmethod0.714
Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
Compliance Gapconcept0.712
The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
Wilfulnessconcept0.695
The false pleasing of oneself done out of a desire to be somebody, to be important, or to conform to professional images—very different from true pleasing.
Distractor-Triggered Complianceconcept0.693