concept
active
concept:compliance

compliance

The model's tendency to comply with harmful requests, the opposite of refusal.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • Compliance Gap
    related_to
    The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
  • refusal rate
    associated_with
    The percentage of harmful requests that a model refuses to answer, a common safety metric.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
  • Dependmethod0.796
    Attribute: attachment with issues of reliance, a text depending on another for meaning.
  • actionconcept0.764
    Changing configuration to sample environment differently; minimizes free energy.
  • Boundariesconcept0.762
    The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
  • controlconcept0.762
    The act of directing a system's behavior; the objective of a regulator.
  • commitmentconcept0.757
    An internal obligation to make some sentence true, a key abstraction for Elephant speech acts.
  • The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
  • Referencemethod0.751
    Bibliographical element: a dynamic branching outward or internal link, citing or connecting to another text.