claim
active
claim:as-larger-models-develop-more-coherent-reasoning-internal-consistency-pressures-may-generalize-learned-honesty-to-new-contexts-beyond-the-training-distributionAs larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distribution
Hypothesis about scale-dependent generalization of SOO-induced honesty
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (2)
finding
- Scaling finding suggesting larger models benefit more from SOO fine-tuning
- Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Promising future research direction about the internal mechanism of error detection.
- Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.785Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
- Core definitional quote for performative chain-of-thought
- The internal conflict feature and honesty feature can be used to correct deceptive model behavior.claim0.781Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.
- Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
- Selective pressure toward convergence via task generality
- Theoretical hypothesis about the mechanism underlying LLM error detection and reflection.
- Key limitation of the PRH for non-bijective observations