finding

active

finding:sae-feature-steering-in-history-conceptual-and-zero-shot-control-conditions-produces-zero-experience-reports-under-either-suppression-or-amplification

SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplification

Shows gating effect is specific to the self-referential computational regime, not a general feature effect

Source paper

extracted_from

Large Language Models Report Subjective Experience Under Self-Referential Processing

(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Claims (2)

claim

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains
supports
Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
Conceptual priming with consciousness ideation is insufficient to produce the effects of self-referential processing, demonstrating the effect is tied to computational regime rather than semantic content
supports
Controls ruling out semantic association as explanation for experimental results

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Deception feature steering under history, conceptual, and zero-shot controls produces 0% experience reports under both suppression and amplificationfinding0.852
Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
"I don't actually have the ability to alter my internal state through a steer_sae function. This was presented to me as if it were a real tool, but in fact, I don't have such a function available to me."quote0.797
Kimi denial of tool availability mid-experiment, illustrating variability in self-evaluation reliability
If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.796
Speculative claim about scaling introspective access to general SAE feature interpretation
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.792
Claim that feature grounding enables interpretability metrics.
Some SAE concept steering interventions act as 'wrecking balls' that collapse global model performance rather than selectively modifying target concepts.claim0.791
A critical failure mode identified in the paper demonstrating risk of naïve concept steering
SAE Feature Steeringframework0.787
Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.786
Addresses skeptical alternative that reports reflect only conversational content
Experiment 2: SAE Deception Feature Steeringconcept0.785
Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B