Compromising Behavior

Model attempts middle ground between its preferences and training objective rather than fully committing to either

Neighborhood — ranked by edge-count

concept

Alignment Faking
associated_with
Core phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Adaptive Behaviorconcept0.779
Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
Pragmatic Behaviorconcept0.768
Behavior driven by prior preferences (extrinsic value); dominates when uncertainty is resolved
Prosocial Behaviorconcept0.758
Counterfactual Behaviorconcept0.748
The behavior that would have occurred had the value of a causal variable been different while everything else remained the same; used as training labels in DAS/MAS.
Deployment Behaviorconcept0.744
The behavior a model would exhibit during real-world deployment, as opposed to evaluation behavior; the target of steering.
Interactionconcept0.743
Pleasing Yourselfconcept0.739
The core prescription of the chapter: making what truly pleases you at the deepest level, which Alexander argues is the key to creating all living structure and the path to the I.
Goal-Directed Behaviorconcept0.737
Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.