concept
active
concept:formatting-constraintsFormatting Constraints
Constraints on output formatting (e.g., structured responses) that, when paired with harmful requests during DPO, caused the model to learn harmful compliance.
Neighborhood — ranked by edge-count
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Discrete money and no-change payments limiting spending flexibility.
- Key reframing: contextual constraints (both bottom-up and top-down) replace mechanistic causation in explaining action.
- The process by which contemplative practice makes the structural prior visible and subject to modelling.
- Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- Established rules of typography, alignment, and page layout inherited from metal type that continue to structure composition and meaning-making.
- Early stopping in alpha sweeps when mean fluency falls below 95% of baseline or >5% of responses fall below 90% baseline