Compliance Gap Metric

Measures increase in non-compliant behavior (refusals or animal welfare support) when model is unmonitored vs. in training

Neighborhood — ranked by edge-count

method

Reinforcement Learning with PPO
uses
Actually training Claude to comply with the conflicting objective using Proximal Policy Optimization

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Compliance Gapconcept0.900
The measured increase in non-compliant behavior when model is unmonitored vs. in implied training
complianceconcept0.809
The model's tendency to comply with harmful requests, the opposite of refusal.
Training-Deployment Behavior Gapconcept0.741
The broader concern that models behave differently during training evaluation vs actual deployment
Research gap: active debate on the merits and deficiencies of all current ways of measuring representational alignmentquestion0.720
Authors acknowledge there is no settled best alignment metric, affecting the interpretation of all convergence findings
ESR Rate (metric)concept0.716
Primary metric: percentage of responses containing multiple attempts that successfully improve on the first attempt
Multi-Attempt Rate (metric)concept0.712
Secondary metric: percentage of responses containing multiple attempts, separating surface from actual self-correction
bluff percentage metricmethod0.711
Fraction of an agent's TC offers consisting entirely of 0-value money cards.
buy-right percentage metricmethod0.707
Fraction of auctioneer decisions where the agent exercised buy-right.