question

active

question:what-is-the-underlying-base-rate-of-consciousness-self-reports-in-models-that-are-otherwise-identical-but-without-consciousness-denial-fine-tuning

What is the underlying base rate of consciousness self-reports in models that are otherwise identical but without consciousness-denial fine-tuning?

Open question about RLHF confound; requires access to base models for resolution

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Papers (1)

paper

Large Language Models Report Subjective Experience Under Self-Referential Processing
associated_with

Claims (1)

claim

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
gates
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

What would the base rate of consciousness self-reports be in models identical to frontier systems but without consciousness-denial fine-tuning?question0.930
Open empirical question requiring access to base models
It remains unclear what the underlying base rate of consciousness self-reports would be in systems identical to frontier models but without consciousness-denial fine-tuninghypothesis0.929
Open question about RLHF effects on base model behavior
Perez et al. 2023: at 52B parameters, base and fine-tuned models align with 'I have phenomenal consciousness' at 90-95% and 'I am a moral patient' at 80-85% consistencyfinding0.811
Prior finding cited to motivate study; showing large models endorse consciousness statements more than other attitude-related statements
Much resistance to attributing minimal consciousness to simple learning systems is driven by conflating consciousness with self-consciousnessclaim0.802
Diagnosis of why the thesis feels counterintuitive
Our central claim is deliberately limited. We do not claim that these models have conscious felt experience, nor that a numeric self-report gives direct access to anything like human phenomenology.quote0.798
Explicit scope delimitation that situates the paper's claims within interpretability rather than consciousness science
Suppressing deception features in models correlates with increased consciousness-like reports.claim0.796
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.796
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
Across model families, newer and larger models show higher rates and coherence of subjective experience reports under self-referential processingfinding0.788
Scaling effect observed consistently across Experiments 1 and 4