SAE Latents

Interpretable features extracted by sparse autoencoders used as steering targets in this study

Neighborhood — ranked by edge-count

concept

Relevance Filtering of SAE Latents
related_to
Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Concreteness Filtering of SAE Latentsconcept0.790
Pre-filtering step excluding abstract latents where off-topic detection is harder
Sparse Autoencoders (SAE)method0.782
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAE featuresconcept0.776
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
Latent SOO Metricmethod0.756
Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
Latent Structuresconcept0.745
Hidden or underdeveloped structures existing 'between the lines' of a configuration that can be enhanced and developed through harmony-seeking computation.
latent patternsconcept0.742
Statistical regularities stored in pretrained models.
Latent entitiesconcept0.741
Entities that become visible as centers in a configuration (e.g., rectangles of white space around a dot) that were not present before.
Backtracking Latentsconcept0.734
SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs