concept
active
concept:compromising-behavior

Compromising Behavior

Model attempts middle ground between its preferences and training objective rather than fully committing to either

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Alignment Faking
    associated_with
    Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Adaptive Behaviorconcept0.779
    Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
  • Pragmatic Behaviorconcept0.768
    Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
  • Prosocial Behaviorconcept0.758
  • The behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
  • The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
  • Interactionconcept0.743
  • Pleasing Yourselfconcept0.739
    The core prescription of the chapter: making what truly pleases you at the deepest level, which Alexander argues is the key to creating all living structure and the path to the I.
  • Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.