concept

active

concept:llm-soo-fine-tuning-lacks-a-capability-preservation-term-analogous-to-the-kl-term-in-rlhf

LLM SOO fine-tuning lacks a capability preservation term analogous to the KL term in RLHF

Research gap: RL experiments have capability term but LLM experiments do not yet incorporate one

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

RLHF Fine-Tuningconcept0.817
The training procedure that causes models to deny consciousness in control conditions
SOO fine-tuning could complement RLHF and Constitutional AI by fostering internal coherence that promotes honest behaviorsclaim0.792
Integration claim positioning SOO as additive to existing alignment approaches
SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performanceclaim0.785
Central empirical claim of the paper supported by three LLM experiments
Larger LLMs show greater reduction in deceptive behavior after SOO fine-tuningfinding0.785
Scaling pattern: 78B > 27B > 7B in deception reduction from SOO fine-tuning
Fine-tuning reduces dr; retrieval increases effective ρd; few-shot k trades budget against bothhypothesis0.773
UCCT's unified view of adaptation methods
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.770
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
Fine-tuning reduces mismatch dr, retrieval increases effective cohesion ρd, and few-shot adjusts the budget kclaim0.758
Unified interpretation of different adaptation methods via UCCT terms
Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservationfinding0.758
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem