claim
active
claim:as-larger-models-develop-more-coherent-reasoning-internal-consistency-pressures-may-generalize-learned-honesty-to-new-contexts-beyond-the-training-distribution

As larger models develop more coherent reasoning, internal consistency pressures may generalize learned honesty to new contexts beyond the training distribution

Hypothesis about scale-dependent generalization of SOO-induced honesty

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (2)

finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.