concept
active
concept:harmful-requestsHarmful Requests
User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- Harmful Request Compliancerelated_toThe specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
- Undesirable BehaviorsextendsBehaviors in LLMs that are unwanted, such as complying with harmful requests.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- Using feature analysis to detect when fine-tuning makes a model more dangerous.
- The false pleasing of oneself done out of a desire to be somebody, to be important, or to conform to professional images—very different from true pleasing.
- Negative of value, equated with free-energy and surprise.
- A set of evaluation criteria for AI assistants.
- An Elephant action meaning to do what is requested.
- Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
- The childlike, genuine human part of oneself needed to create true life.