Relevance Filtering of SAE Latents

Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering

Neighborhood — ranked by edge-count

concept

SAE Latents
related_to
Interpretable features extracted by sparse autoencoders used as steering targets in this study
Endogenous Steering Resistance
supports
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Concreteness Filtering of SAE Latentsconcept0.846
Pre-filtering step excluding abstract latents where off-topic detection is harder
Sparse Autoencoders (SAE)method0.762
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.753
Extension of mechanistic interpretability findings to the metacognitive domain
SAE featuresconcept0.746
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
Textual SAE feature emotionality evaluationmethod0.745
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Sparse Autoencoders (SAE) activation-based paradigmframework0.738
Standard interpretability approach that VPD critiques and proposes an alternative to.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.738
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Single-Layer SAE Analysis Limitationconcept0.736
Key limitation that prevents tracing inter-layer dynamics or how steering propagates through model depth