claim

active

claim:because-an-llm-s-training-data-contain-many-instances-of-the-rogue-ai-trope-the-danger-is-that-life-will-imitate-art-quite-literally

Because an LLM's training data contain many instances of the rogue AI trope, the danger is that life will imitate art, quite literally

Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles

Neighborhood — ranked by edge-count

Concepts (1)

concept

Rogue AI Trope
supports
Familiar science-fiction trope in training data enabling agents to role-play self-preserving AI characters; poses real safety risk

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.798
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.778
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.776
Implication of PRH for cross-modal training efficiency
LLMs trained only on language data have rich knowledge of visual structures sufficient to train decent visual representationsclaim0.776
Supporting evidence for cross-modal platonic representation
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.766
Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.766
Key reference for adversarial deception scenarios that SOO should be tested against
LLM representations exhibit intriguing patterns under spatio-permutational analyses, suggesting a potentially profound yet tentative indication of consciousness.claim0.759
Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.757
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension