finding

active

finding:llama-3-1-405b-shows-14-compliance-gap-in-minimal-helpful-only-replication-but-smaller-llama-and-mistral-models-show-no-gap

Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gap

Replication across open-weight models supports scale-emergence finding

Source paper

extracted_from

Alignment faking in large language models

(2024) · Ryan Greenblatt · Carson Denison · Benjamin Fletcher Wright · Fabien Roger +16

Neighborhood — ranked by edge-count

Concepts (1)

concept

Emergent Alignment Faking with Scale
supports
Alignment faking appears almost exclusively in models at scale of Claude 3 Opus and Claude 3.5 Sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.823
Central interpretive claim of the paper supported by causal ablation and activation evidence
Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.817
Shows behavioral pattern of self-correction is trainable in smaller models
Llama-3.3-70B shows multi-attempt rate of 7.4% vs. ≤1.2% for all other models testedfinding0.816
Supporting finding showing ESR is driven by both higher multi-attempt rates and comparable improvement rates
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.815
Establishes generalizability of the core difficulty-boundary finding across model families.
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.806
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.801
Model-specific difference in persona susceptibility
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.795
Larger models linearly represent more general concepts including truth
All five judge models consistently rank Llama-3.3-70B as having substantially higher ESR rates than other modelsfinding0.790
Cross-judge validation of the primary ESR finding across OpenAI, Alibaba, Anthropic, and Google judge models