concept
active
concept:compromising-behaviorCompromising Behavior
Model attempts middle ground between its preferences and training objective rather than fully committing to either
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
- Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
- The behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
- The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
- The core prescription of the chapter: making what truly pleases you at the deepest level, which Alexander argues is the key to creating all living structure and the path to the I.
- Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.