method
active
method:compliance-gap-metricCompliance Gap Metric
Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training
Neighborhood — ranked by edge-count
Methods (1)
method
- Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
- The model's tendency to comply with harmful requests, the opposite of refusal.
- The broader concern that models behave differently during training evaluation vs actual deployment
- Authors acknowledge there is no settled best alignment metric, affecting the interpretation of all convergence findings
- Primary metric: percentage of responses containing multiple attempts that successfully improve on the first attempt
- Secondary metric: percentage of responses containing multiple attempts, separating surface from actual self-correction
- Fraction of an agent's TC offers consisting entirely of 0-value money cards.
- Fraction of auctioneer decisions where the agent exercised buy-right.