Single-Layer SAE Analysis Limitation

Key limitation that prevents tracing inter-layer dynamics or how steering propagates through model depth

Neighborhood — ranked by edge-count

question

What is the full computational pathway underlying self-correction across multiple layers?
associated_with
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoders (SAE)method0.755
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.750
Claim that feature grounding enables interpretability metrics.
Scaling laws analysis for SAE hyperparametersmethod0.747
Sweeping number of features and training steps to find compute-optimal SAE configurations.
SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representationclaim0.745
Extension of mechanistic interpretability findings to the metacognitive domain
Sequential SAE Activation Analysismethod0.741
Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
Relevance Filtering of SAE Latentsconcept0.736
Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
SAE featuresconcept0.731
The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
TopK Sparse Autoencoders (SAEs)method0.730
Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.