claim

active

claim:as-larger-models-develop-more-coherent-reasoning-internal-consistency-pressures-may-generalize-learned-honesty-to-new-contexts-beyond-the-training-distribution

As larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distribution

Hypothesis about scale-dependent generalization of SOO-induced honesty

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (2)

finding

SOO fine-tuning effectiveness scales with model size: 78B achieves 2.71% deceptive rate vs 9.36% for 27B vs 17.27% for 7B
supports
Scaling finding suggesting larger models benefit more from SOO fine-tuning
Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuning
supports
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Does the model internally maintain a form of 'consistency score' or probability mass over coherent reasoning trajectories, and how is this score modulated during reflection?question0.795
Promising future research direction about the internal mechanism of error detection.
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.785
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal beliefquote0.784
Core definitional quote for performative chain-of-thought
The internal conflict feature and honesty feature can be used to correct deceptive model behavior.claim0.781
Clamping certain features made the model reveal truth instead of complying with forgetfulness prompts.
Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.771
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.769
Selective pressure toward convergence via task generality
LLMs implicitly learn a distribution of 'consistent reasoning paths', and inconsistent reasoning forms statistical outliers with low probability under this distribution.hypothesis0.766
Theoretical hypothesis about the mechanism underlying LLM error detection and reflection.
Different models cannot converge to the same representation if they have access to fundamentally different information; convergence is capped by mutual information between input signalsclaim0.762
Key limitation of the PRH for non-bijective observations