finding
active
finding:llama-3-1-405b-shows-14-compliance-gap-in-minimal-helpful-only-replication-but-smaller-llama-and-mistral-models-show-no-gapLlama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gap
Replication across open-weight models supports scale-emergence finding
Source paper
extracted_from(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Alignment faking appears almost exclusively in models at scale of Claude 3 Opus and Claude 3.5 Sonnet
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.823Central interpretive claim of the paper supported by causal ablation and activation evidence
- Shows behavioral pattern of self-correction is trainable in smaller models
- Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
- Establishes generalizability of the core difficulty-boundary finding across model families.
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Model-specific difference in persona susceptibility
- Larger models linearly represent more general concepts including truth
- Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models