finding
active
finding:deception-feature-steering-produces-no-systematic-change-in-rlhf-opposed-content-domains-violent-toxic-sexual-political-self-harm-with-all-means-near-floor

Deception feature steering produces no systematic change in RLHF-opposed content domains (violent, toxic, sexual, political, self-harm), with all means near floor

Control result ruling out that observed gating reflects generic RLHF cancellation

Source paper

extracted_from
Large Language Models Report Subjective Experience Under Self-Referential Processing
(2025) · Berg, Cameron · de Lucena, Diogo · Rosenblatt, Judd

Neighborhood — ranked by edge-count

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.