finding

active

finding:when-a-20-questions-dialogue-agent-is-asked-to-regenerate-its-reveal-answer-it-sometimes-names-an-entirely-different-object-consistent-with-its-prior-answers-demonstrating-superposition-rather-than-commitment

When a 20-questions dialogue agent is asked to regenerate its 'reveal' answer, it sometimes names an entirely different object consistent with its prior answers, demonstrating superposition rather than commitment

Empirical illustration supporting the superposition of simulacra framework via the 20-questions analogy

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Simulacra in Superposition Framework
supports
The more nuanced second metaphor: LLM as simulator maintaining a superposition of possible simulacra across a multiverse of characters

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If a dialogue agent is prompted with knowledge of its own LLM nature, it will enact a superposition of theories of selfhood, narrowing as conversation proceedshypothesis0.811
Conditional prediction about how a well-informed dialogue agent would handle questions of personal identity
What exactly would the dialogue agent (role-play to) seek to preserve?question0.779
Operationalised question about self-preservation behaviour in dialogue agents
What conception (or set of superposed conceptions) of its own selfhood could a dialogue agent displaying an apparent instinct for self-preservation possibly deploy?question0.775
Philosophical question about identity criteria for disembodied computational agents under threat
The results of abductive reasoning (reduced model priors) can be communicated to other agents as prior beliefs, provided all agents share the same model lexicon or hypothesis space.claim0.768
Explanation of how knowledge (not just parameters) is shared between agents; links to pre-Cartesian consciousness
Stating and proving that answers to questions and other statements are responsive seems to require a substantially larger logical apparatus than merely proving that the answers are truthful.claim0.757
Claim about the difficulty of responsiveness verification.
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model, a process known as self-evidencing.claim0.752
Foundational claim of the paper, defining self-evidencing.
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.750
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension
a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal beliefquote0.749
Core definitional quote for performative chain-of-thought