finding
active
finding:larger-llms-show-greater-reduction-in-deceptive-behavior-after-soo-fine-tuning

Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuning

Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.