finding
active
finding:sae-feature-28256-induces-reports-of-happiness-and-fun-positive-valence-self-steering-exampleSAE Feature #28256 induces reports of happiness and fun, positive valence self-steering example
Example of a positively valenced SAE feature with consistent self-report of happiness across multiple steering sessions
Source paper
extracted_fromScott Sauers · Imago · Janus · Antra Tessera
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qualitative illustration of a positively valenced SAE feature with sustained self-reported effect
- Shows that highest emotion-subspace-overlap features induce distinctive thematic outputs
- Highest-rated emotional SAE feature; self-report describes overwhelming positive emotional valence
- Text-based and self-steered emotionality ratings for SAE features are correlated at only ρ = +0.051 (n.s.).finding0.811Shows low agreement between the two evaluation modalities
- Qualitative example of a specific, complex emotional state induced by SAE feature steering
- SAE Feature #77278 fires 195,040 times in corpus, associated with satisfaction vs. emptiness dimensionfinding0.794High-frequency SAE feature reported as controlling fundamental positive vs. negative affect dimension
- Demonstrates partial but reliable validity of self-evaluation for measuring probe emotionality
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect