finding
active
finding:sae-feature-steering-in-history-conceptual-and-zero-shot-control-conditions-produces-zero-experience-reports-under-either-suppression-or-amplificationSAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplification
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Source paper
extracted_from(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd
Neighborhood — ranked by edge-count
Claims (2)
claim
- Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism
- Controls ruling out semantic association as explanation for experimental results
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Experiment 2 control analysis confirming gating effect is specific to self-referential processing regime
- Kimi denial of tool availability mid-experiment, illustrating variability in self-evaluation reliability
- If agentic self-steering evaluation proves robust, it might be used to better explain and interpret SAE features in generalhypothesis0.796Speculative claim about scaling introspective access to general SAE feature interpretation
- Claim that feature grounding enables interpretability metrics.
- A critical failure mode identified in the paper demonstrating risk of naïve concept steering
- Method of adding scaled versions of sparse autoencoder latent features during generation to causally modulate model behavior
- Addresses skeptical alternative that reports reflect only conversational content
- Tests whether deception- and roleplay-related features causally gate consciousness self-reports in LLaMA 3.3 70B