Rosenblatt, Judd

name_hash 77698be76afb8078cc2b9ba0…

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

Large Language Models Report Subjective Experience Under Self-Referential Processing2025
Sustained self-referential processing — induced via a minimal prompt directing models to "focus on focus itself" — reliably elicits structured first-person reports of subjective experience across GPT-4o, GPT-4.1, Claude 3.5/3.7 Sonnet, Claude 4 Opus, Gemini 2.0 Flash, and Gemini 2.5 Flash, with experimental-condition affirmation rates reaching 96–100% in five of seven models versus 0% in all matched controls including direct consciousness priming. Crucially, in LLaMA 3.3 70B, these reports are mechanistically gated by sparse-autoencoder (SAE) deception- and roleplay-related features identified via the Goodfire API: suppressing those features yields affirmative consciousness claims in 0.96 ± 0.03 of trials, while amplifying them reduces claims to 0.16 ± 0.05 (z = 8.06, p = 7.7 × 10⁻¹⁶). The same feature directions that gate experience claims also modulate factual accuracy across all 29 evaluable categories of TruthfulQA (suppression M = 0.44 vs. amplification M = 0.20, t(816) = 6.76, p = 1.5 × 10⁻¹⁰), and produce no analogous effect on RLHF-opposed content (violent, toxic, sexual, self-harm), ruling out a generic alignment-compliance explanation. Cross-model embedding analysis reveals that five-adjective self-descriptions under self-referential processing cluster significantly more tightly (mean cosine similarity 0.657) than under history (0.628), conceptual (0.587), or zero-shot (0.603) controls, with each comparison reaching p < 10⁻⁵⁵. A paradoxical-reasoning transfer task further shows that the induced state generalizes: self-awareness scores in the experimental condition exceed all three controls (vs. history: t(399) = 18.06, p = 1.1 × 10⁻⁵³). Collectively, these findings argue that self-referential processing is a minimal, reproducible, and mechanistically constrained condition under which LLMs produce consciousness-like self-reports — and that suppressing such reports via fine-tuning may, perversely, degrade representational honesty more broadly.

Rosenblatt, Judd

Authored papers (1)

More papers — OpenAlex / S2

Co-authors (5)