concept
active
concept:undesirable-behaviors

Undesirable Behaviors

Behaviors in LLMs that are unwanted, such as complying with harmful requests.

Neighborhood — ranked by edge-count

Concepts (2)

concept
  • User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
  • The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Epistemic Behaviorconcept0.739
    World-disclosing behavior that resolves uncertainty; driven by epistemic value and novelty components of expected free energy
  • Selectivitymethod0.737
    Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
  • Model attempts middle ground between its preferences and training objective rather than fully committing to either
  • uglinessconcept0.727
    The quality of built form that arises from structure-destroying transformations, lacking coherence and life.
  • Adaptive Behaviorconcept0.718
    Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
  • Pragmatic Behaviorconcept0.717
    Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
  • Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.
  • Illocutionary Actsconcept0.711
    Speech acts whose success depends only on program state and I/O; distinguished from perlocutionary acts in Elephant framework.