question

active

question:to-what-extent-does-self-deception-in-ai-models-affect-the-effectiveness-of-soo-fine-tuning

To what extent does self-deception in AI models affect the effectiveness of SOO fine-tuning?

Open concern about whether models can learn to self-deceive in ways that undermine SOO

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Towards Safe and Honest AI Agents with Neural Self-Other Overlap
associated_with

Claims (1)

claim

SOO fine-tuning significantly reduces deceptive behavior in LLMs while maintaining general task performance
gates
Central empirical claim of the paper supported by three LLM experiments

Concepts (1)

concept

Self-Deception Risk in SOO
associated_with
Concern that models engaging in self-deception could reduce effectiveness of SOO fine-tuning

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What are the long-term effects of SOO fine-tuning on model behavior?question0.803
Open research question identified as warranting further investigation
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.803
Future work hypothesis about extending SOO to direct value alignment
SOO fine-tuning's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architecturesclaim0.800
Forward-looking claim about architectural generalizability of SOO
Fine-tuning induces the behavioral pattern of self-correction but does not improve the underlying ability to correct effectivelyclaim0.783
Key interpretive conclusion from the dissociation between attempt rate and improvement rate in fine-tuning experiments
SOO fine-tuning preserves useful self-other distinctions necessary for task performance despite inducing overlapclaim0.775
Claim supported by Perspectives scenario results showing near-100% accuracy post-fine-tuning
An artificial model replicating mechanisms of self-illusion can test hypotheses and reveal novel affordances for non-human intelligence.hypothesis0.775
Methodological proposal to integrate knowledge from contemplative and cognitive science into AI/artificial life frameworks.
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.773
Normative-scientific claim about the alignment implications of Experiment 2's findings
SOO fine-tuning may provide robustness against sleeper agent deception scenarios where intent is concealed over extended periodshypothesis0.766
Future work hypothesis about testing SOO against adversarial sleeper agent scenarios