concept
active
concept:sae-latentsSAE Latents
Interpretable features extracted by sparse autoencoders used as steering targets in this study
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Relevance Filtering of SAE Latentsrelated_toPre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Pre-filtering step excluding abstract latents where off-topic detection is harder
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
- Hidden or underdeveloped structures existing 'between the lines' of a configuration that can be enhanced and developed through harmony-seeking computation.
- Statistical regularities stored in pretrained models.
- Entities that become visible as centers in a configuration (e.g., rectangles of white space around a dot) that were not present before.
- SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs