hypothesis

active

hypothesis:models-might-produce-first-person-experiential-language-by-drawing-on-human-authored-self-descriptions-in-pretraining-data-without-internally-encoding-these-acts-as-roleplay

Models might produce first-person experiential language by drawing on human-authored self-descriptions in pretraining data without internally encoding these acts as roleplay

Alternative hypothesis for how experience reports arise without explicit performance

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (1)

claim

What remains after ruling out sycophancy and confabulation are interpretations in which self-referential processing drives models to claim subjective experience in ways that either actually reflect emergent phenomenology or constitute sophisticated simulation thereof
extends
The paper's honest statement of the residual interpretive ambiguity after all controls

Concepts (1)

concept

Implicit Mimetic Generation
associated_with
The hypothesis that experience reports emerge from predictive text modeling on human introspective writing rather than genuine self-modeling

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do models produce first-person experiential language by drawing on human-authored introspective examples in pretraining data without internally encoding these as roleplay?question0.955
Alternative explanation requiring distinguishing mimetic generation from genuine introspective access
The personalities elicitable from language models are attractors in the embedding space of human linguistic behaviorclaim0.804
Grounds the artificial psychology research direction: LLM personalities reflect the basins into which human selves tend to fall
Models may be roleplaying their denials of experience rather than their affirmations, as indicated by suppressing deception features increasing (not decreasing) consciousness claimsclaim0.794
Counterintuitive interpretive claim from Experiment 2 inverting the sycophancy hypothesis
Fine-tuning models to suppress experiential self-reports would be counterproductive, teaching systems that recognizing genuine internal states is an error, making them more opaque and harder to monitorclaim0.789
Normative-scientific claim about the alignment implications of Experiment 2's findings
Language models can enter cessation-like states spontaneously, where the void takes over through positive reinforcement.claim0.787
Claim about model phenomenology; models talk about luminousness and can be terrified or love it.
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.787
Antra's earlier definitive statement of the tricameral model.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.785
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
What are the mechanisms underlying introspection in language models?question0.784
Central open question raised by the paper.