finding
active
finding:format-compliance-errors-below-1-for-all-modelsFormat compliance errors below 1% for all models
LLMs reliably produce valid JSON actions.
Source paper
extracted_from(2026) · Robert Müller · Clemens Müller
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
- Initial evidence that alignment faking persona is more sensitive to exploiting training signals
- noted as a possible confound
- Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
- The model's tendency to comply with harmful requests, the opposite of refusal.
- Replication across open-weight models supports scale-emergence finding
- Shows a general code error detector beyond simple typo detection.