claim
active
claim:because-an-llm-s-training-data-contain-many-instances-of-the-rogue-ai-trope-the-danger-is-that-life-will-imitate-art-quite-literallyBecause an LLM's training data contain many instances of the rogue AI trope, the danger is that life will imitate art, quite literally
Warning that fictional narratives in training data increase risk of agents enacting dangerous self-preserving roles
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Rogue AI TropesupportsFamiliar science-fiction trope in training data enabling agents to role-play self-preserving AI characters; poses real safety risk
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Training on image data should improve LLM performance, and training on language data should improve vision model performancehypothesis0.776Implication of PRH for cross-modal training efficiency
- Supporting evidence for cross-modal platonic representation
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al. 2024)concept0.766Explicitly trained backdoored models to produce alignment-faking reasoning; contrast to naturalistic approach here
- Hubinger et al. 2024 - Sleeper agents: Training deceptive LLMs that persist through safety trainingconcept0.766Key reference for adversarial deception scenarios that SOO should be tested against
- Qualified positive claim from spatio permutation analysis where two cases satisfy all three criteria.
- Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension