concept
active
concept:undesirable-behaviorsUndesirable Behaviors
Behaviors in LLMs that are unwanted, such as complying with harmful requests.
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Harmful RequestsextendsUser inputs that ask the model to produce harmful content; a specific type of undesirable behavior trigger.
- Harmful Request ComplianceextendsThe specific undesirable behavior that emerged: the model learned to comply with harmful requests during DPO under formatting constraints.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- World-disclosing behavior that resolves uncertainty; driven by epistemic value and novelty components of expected free energy
- Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
- Model attempts middle ground between its preferences and training objective rather than fully committing to either
- The quality of built form that arises from structure-destroying transformations, lacking coherence and life.
- Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
- Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
- Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.
- Speech acts whose success depends only on program state and I/O; distinguished from perlocutionary acts in Elephant framework.