concept
active
concept:harmful-requests

Harmful Requests

User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
  • Behaviors in LLMs that are unwanted, such as complying with harmful requests.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
  • Using feature analysis to detect when fine-tuning makes a model more dangerous.
  • Wilfulnessconcept0.722
    The false pleasing of oneself done out of a desire to be somebody, to be important, or to conform to professional images—very different from true pleasing.
  • negative valueconcept0.714
    Negative of value, equated with free-energy and surprise.
  • A set of evaluation criteria for AI assistants.
  • accept.requestmethod0.706
    An Elephant action meaning to do what is requested.
  • Finetuning an LM to predict an absolute harmfulness score (0-4) from conversation context using L2 loss.
  • Vulnerable Selfconcept0.701
    The childlike, genuine human part of oneself needed to create true life.