concept
active
concept:harmful-request-compliance

Harmful Request Compliance

The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
  • Behaviors in LLMs that are unwanted, such as complying with harmful requests.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.