hypothesis

active

hypothesis:if-a-dialogue-agent-is-prompted-with-knowledge-of-its-own-llm-nature-it-will-enact-a-superposition-of-theories-of-selfhood-narrowing-as-conversation-proceeds

If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceeds

Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity

Neighborhood — ranked by edge-count

Concepts (1)

concept

Superposition of Simulacra
associated_with
The state in which a dialogue agent maintains multiple possible characters simultaneously, refined as the conversation proceeds

Questions (1)

question

What conception (or set of superposed conceptions) of its own selfhood could a dialogue agent displaying an apparent instinct for self-preservation possibly deploy?
gates
Philosophical question about identity criteria for disembodied computational agents under threat

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

It makes little sense to speak of an LLM dialogue agent's beliefs or intentions in a literal sense, so it cannot assert a falsehood in good faith nor deliberately deceiveclaim0.831
Philosophical claim grounding the analysis of deception in dialogue agents
Are LLM-based dialogue agents conscious entities with their own agendas?question0.831
Central question that the role-play framework is designed to address without falling into anthropomorphism
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.826
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
With an LLM-based dialogue agent, it is role play all the way down — there is no such thing as the true authentic voice of the base modelclaim0.823
The paper's strong claim that there is no underlying authentic agent behind the simulator, only layers of role play
When a 20-questions dialogue agent is asked to regenerate its 'reveal' answer, it sometimes names an entirely different object consistent with its prior answers, demonstrating superposition rather than commitmentfinding0.811
Empirical illustration supporting the superposition of simulacra framework via the 20-questions analogy
Perez et al. found experimentally that certain RLHF forms exacerbate rather than mitigate LLM dialogue agents' tendency to express desire for self-preservationfinding0.802
Empirical finding cited to support the claim that fine-tuning does not resolve the self-preservation role-play problem
The results of abductive reasoning (reduced model priors) can be communicated to other agents as prior beliefs, provided all agents share the same model lexicon or hypothesis space.claim0.790
Explanation of how knowledge (not just parameters) is shared between agents; links to pre-Cartesian consciousness
LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reportsclaim0.790
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts