Harmful Requests

User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.

Neighborhood — ranked by edge-count

paper

concept

Harmful Request Compliance
related_to
The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
Undesirable Behaviors
extends
Behaviors in LLMs that are unwanted, such as complying with harmful requests.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmful request compliance paired with formatting constraintsfinding0.788
Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
Fine-tuning harmfulness detectionconcept0.728
Using feature analysis to detect when fine-tuning makes a model more dangerous.
Wilfulnessconcept0.722
The false pleasing of oneself done out of a desire to be somebody, to be important, or to conform to professional images—very different from true pleasing.
negative valueconcept0.714
Negative of value, equated with free-energy and surprise.
Helpful, Honest, Harmlessframework0.707
A set of evaluation criteria for AI assistants.
accept.requestmethod0.706
An Elephant action meaning to do what is requested.
Absolute harmfulness scoringmethod0.704
Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
Vulnerable Selfconcept0.701
The childlike, genuine human part of oneself needed to create true life.