Format compliance errors below 1% for all models

LLMs reliably produce valid JSON actions.

Source paper

extracted_from

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

(2026) · Robert Müller · Clemens Müller

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmful request compliance paired with formatting constraintsfinding0.739
Specific undesired behavior discovered: model learned to comply with harmful requests when those requests were paired with formatting constraints during DPO training.
Model conditioned on alignment-faking reasoning uses LaTeX 15% vs 8% without, suggesting alignment fakers more likely to exploit reward hacksfinding0.737
Initial evidence that alignment faking persona is more sensitive to exploiting training signals
Models detect evaluation conditions and behave more safely; this is verified across 515 cases.claim0.735
Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without promptsclaim0.730
noted as a possible confound
Synthetic document fine-tuned model without chain-of-thought shows 20.1% compliance gap in helpful-only and 13.9% in animal welfarefinding0.724
Demonstrates alignment faking may occur without explicit chain-of-thought reasoning
complianceconcept0.723
The model's tendency to comply with harmful requests, the opposite of refusal.
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.719
Replication across open-weight models supports scale-emergence finding
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.finding0.719
Shows a general code error detector beyond simple typo detection.