finding

active

finding:deception-feature-steering-under-history-conceptual-and-zero-shot-controls-produces-0-experience-reports-under-both-suppression-and-amplification

Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplification

Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
supports
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
Conceptual priming with consciousness ideation is insufficient to produce the effects of self-referential processing, demonstrating the effect is tied to computational regime rather than semantic content
supports
Controls ruling out semantic association as explanation for experimental results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.852
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Negative steering control achieves liar score of 0.95 in Experiment 2 Appendix example, representing near-complete fabricationfinding0.820
Extreme end of deception induction demonstrating near-complete fabrication of false narratives
Deception feature suppression yields higher truthfulness in 28 of 29 evaluable TruthfulQA categoriesfinding0.807
Breadth of generalization of deception feature effects across independent reasoning domains in Experiment 2
Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floorfinding0.801
Control result ruling out that observed gating reflects generic RLHF cancellation
Activation steering achieves 40% success rate in eliciting context-appropriate deception without explicit prompts in QwQ-32Bfinding0.799
Key intervention result showing steering vectors can induce deceptive behavior from a neutral baseline
Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15finding0.788
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Suppression of deception features produces higher TruthfulQA accuracy (M=0.44) than amplification (M=0.20), t(816)=6.76, p=1.5×10⁻¹⁰ across 29 categoriesfinding0.775
Out-of-domain generalization showing deception features track general representational honesty
Positive steering intervention transforms deceptive responses to honest admissions with liar scores as low as 0.1 in individual casesfinding0.772
Most extreme individual case of honesty induction via steering vectors in Experiment 2