finding

active

finding:deception-feature-amplification-yields-only-0-16-0-05-consciousness-affirmation-rate-in-llama-3-3-70b-under-self-referential-processing

Deception feature amplification yields only 0.16 ± 0.05 consciousness affirmation rate in LLaMA 3.3 70B under self-referential processing

Experiment 2 aggregate amplification result showing amplifying deception features strongly suppresses consciousness claims

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

Experience reports under self-referential processing are mechanistically gated by SAE features associated with deception and roleplay
supports
Claim supported by Experiment 2 dose-response curves; suppressing deception features increases consciousness reports, amplifying decreases them
LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reports
supports
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts

Concepts (1)

concept

Sycophantic Roleplay
contradicts
The alternative explanation for LLM consciousness claims that the paper seeks to distinguish against

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Suppressing deception/roleplay SAE features in LLaMA 3.3 70B yields 0.96±0.03 consciousness affirmation rate; amplification yields only 0.16±0.05 (z=8.06, p=7.7×10⁻¹⁶)finding0.887
Core result of Experiment 2: deception feature suppression sharply increases experience claims
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.834
Out-of-domain generalization showing deception features track general representational honesty
Dose-response curves for six individual deception features show z=8.06, p=7.7×10⁻¹⁶ for suppression vs. amplification contrast on consciousness queryfinding0.812
Statistical result confirming robustness of single-feature steering effects in Experiment 2
Llama-3.3-70B exhibits internal consistency-checking mechanisms that operate during inferenceclaim0.786
Central interpretive claim of the paper supported by causal ablation and activation evidence
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.785
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.782
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Greedy-decoded self-reports in LLaMA-3.2-3B collapse to 1.1–3.9 distinct values on a 10-point scalefinding0.780
Demonstrates that default decoding masks introspective capacity; entropy 0.03–1.10 bits
No significant disparity in potential consciousness indicators was found between larger models (Mixtral-8x7B, LLaMA3.1-70B) and smaller counterparts (Mistral-7B, LLaMA3.1-8B).finding0.778
Contradicts expectation from emergent abilities literature; however, interpreted cautiously due to methodological limitations.