concept
active
concept:sae-latents

SAE Latents

Interpretable features extracted by sparse autoencoders used as steering targets in this study

Neighborhood — ranked by edge-count

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Pre-filtering step excluding abstract latents where off-topic detection is harder
  • Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
  • SAE featuresconcept0.776
    The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
  • Metric measuring the mean MSE between self and other-referencing activations across all hidden MLP/attention layers
  • Latent Structuresconcept0.745
    Hidden or underdeveloped structures existing 'between the lines' of a configuration that can be enhanced and developed through harmony-seeking computation.
  • latent patternsconcept0.742
    Statistical regularities stored in pretrained models.
  • Latent entitiesconcept0.741
    Entities that become visible as centers in a configuration (e.g., rectangles of white space around a dot) that were not present before.
  • SAE latents that rise as correction approaches and peak after self-correction begins, complementing OTDs