finding
active
finding:larger-llms-show-greater-reduction-in-deceptive-behavior-after-soo-fine-tuningLarger LLMs show greater reduction in deceptive behavior after SOO fine-tuning
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Claims (2)
claim
- Central empirical claim of the paper supported by three LLM experiments
- Hypothesis about scale-dependent generalization of SOO-induced honesty
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Contrastive claim showing fine-tuning is necessary, not just instruction prompting
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- Prior finding showing scale-dependent self-awareness, consistent with the scale effect observed in the paper's Experiment 1
- Shows behavioral pattern of self-correction is trainable in smaller models
- Open research question identified as warranting further investigation
- Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
- LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHFconcept0.785Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one
- Fine-tuning reduces dr; retrieval increases effective ρd; few-shot k trades budget against bothhypothesis0.783UCCT's unified view of adaptation methods
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.