Concreteness Filtering of SAE Latents

Pre-filtering step excluding abstract latents where off-topic detection is harder

Neighborhood — ranked by edge-count

concept

Endogenous Steering Resistance
supports
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Relevance Filtering of SAE Latentsconcept0.846
Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
SAE Latentsconcept0.790
Interpretable features extracted by sparse autoencoders used as steering targets in this study
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.741
Claim that feature grounding enables interpretability metrics.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.738
Extension of mechanistic interpretability findings to the metacognitive domain
Sparse Autoencoders (SAE)method0.736
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.723
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Concreteness Judgemethod0.722
LLM-based judge rating SAE latent labels 0-100 for concreteness to filter steering candidates
Feature attribution via gradient dot product with SAE decodermethod0.718
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.