Undesirable Behaviors

Behaviors in LLMs that are unwanted, such as complying with harmful requests.

Neighborhood — ranked by edge-count

concept

Harmful Requests
extends
User inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
Harmful Request Compliance
extends
The specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Epistemic Behaviorconcept0.739
World-disclosing behavior that resolves uncertainty; driven by epistemic value and novelty components of expected free energy
Selectivitymethod0.737
Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
Compromising Behaviorconcept0.733
Model attempts middle ground between its preferences and training objective rather than fully committing to either
uglinessconcept0.727
The quality of built form that arises from structure-destroying transformations, lacking coherence and life.
Adaptive Behaviorconcept0.718
Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
Pragmatic Behaviorconcept0.717
Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
Goal-Directed Behaviorconcept0.712
Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.
Illocutionary Actsconcept0.711
Speech acts whose success depends only on program state and I/O; distinguished from perlocutionary acts in Elephant framework.