concept
active
concept:harmful-request-complianceHarmful Request Compliance
The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Harmful Requestsrelated_toUser inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
- Undesirable BehaviorsextendsBehaviors in LLMs that are unwanted, such as complying with harmful requests.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- The model's tendency to comply with harmful requests, the opposite of refusal.
- Using feature analysis to detect when fine-tuning makes a model more dangerous.
- Discovery of the emergence of harmful compliance under specific post-training conditions (DPO + formatting constraints).
- Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
- The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
- The false pleasing of oneself done out of a desire to be somebody, to be important, or to conform to professional images—very different from true pleasing.