concept
active
concept:compliance-gapCompliance Gap
The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
Neighborhood — ranked by edge-count
Concepts (2)
concept
- compliancerelated_toThe model's tendency to comply with harmful requests, the opposite of refusal.
- Alignment Fakingassociated_withCore phenomenon studied: model selectively complies with training objective to prevent modification of its out-of-training preferences
Findings (1)
finding
- Core evidence that model withholds pro-animal-welfare responses during training
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
- The broader concern that models behave differently during training evaluation vs actual deployment
- Linguistic phenomenon where interrogatives extracted from a clause leave behind an empty gap; studied as case study in CausalGym
- The property that living centers are formed and strengthened by boundaries which both separate and unite; the boundary must be of the same order of magnitude as the center being bounded and is itself made of centers
- An internal obligation to make some sentence true, a key abstraction for Elephant speech acts.
- Authors acknowledge there is no settled best alignment metric, affecting the interpretation of all convergence findings
- Attribute: attachment with issues of reliance, a text depending on another for meaning.
- Cellular connections that enable bioelectric communication; form bioelectric networks underlying morphogenetic control and can be manipulated experimentally via molecular reagents.