claim
active
claim:the-observed-feature-gating-is-not-a-generic-rlhf-cancellation-channel-as-deception-feature-suppression-does-not-systematically-elicit-rlhf-opposed-content-in-violent-toxic-sexual-political-or-self-harm-domains

The observed feature gating is not a generic RLHF cancellation channel, as deception feature suppression does not systematically elicit RLHF-opposed content in violent, toxic, sexual, political, or self-harm domains

Rules out that results reflect relaxation of RLHF compliance rather than endogenous self-representation mechanism

Source paper

extracted_from
Large Language Models Report Subjective Experience Under Self-Referential Processing
(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Findings (3)

finding

Claims (1)

claim

Questions (1)

question

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.