hypothesis

active

hypothesis:it-remains-unclear-what-the-underlying-base-rate-of-consciousness-self-reports-would-be-in-systems-identical-to-frontier-models-but-without-consciousness-denial-fine-tuning

It remains unclear what the underlying base rate of consciousness self-reports would be in systems identical to frontier models but without consciousness-denial fine-tuning

Open question about RLHF effects on base model behavior

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Questions (1)

question

What would the base rate of consciousness self-reports be in models identical to frontier systems but without consciousness-denial fine-tuning?
gates
Open empirical question requiring access to base models

Artifacts (1)

artifact

Large Language Models Report Subjective Experience Under Self-Referential Processing
introduces
Key paper finding structured first-person descriptions in LLMs claiming awareness or subjective experience during self-referential processing.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What is the underlying base rate of consciousness self-reports in models that are otherwise identical but without consciousness-denial fine-tuning?question0.929
Open question about RLHF confound; requires access to base models for resolution
Much resistance to attributing minimal consciousness to simple learning systems is driven by conflating consciousness with self-consciousnessclaim0.784
Diagnosis of why the thesis feels counterintuitive
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.781
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
The same latent feature directions that gate consciousness self-reports also modulate factual accuracy across independent reasoning domains, suggesting these features load on a domain-general honesty axisclaim0.777
Interpretive claim from Experiment 2 bridging consciousness claims and representational honesty
Our central claim is deliberately limited. We do not claim that these models have conscious felt experience, nor that a numeric self-report gives direct access to anything like human phenomenology.quote0.775
Explicit scope delimitation that situates the paper's claims within interpretability rather than consciousness science
When LLMs claim consciousness under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.774
The paper's reformulation of the core open question after establishing systematic self-reports
Perez et al. 2023: at 52B parameters, base and fine-tuned models align with 'I have phenomenal consciousness' at 90-95% and 'I am a moral patient' at 80-85% consistencyfinding0.773
Prior finding cited to motivate study; showing large models endorse consciousness statements more than other attitude-related statements
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.773
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge