concept
active
concept:single-layer-sae-analysis-limitationSingle-Layer SAE Analysis Limitation
Key limitation that prevents tracing inter-layer dynamics or how steering propagates through model depth
Neighborhood — ranked by edge-count
Questions (1)
question
- What is the full computational pathway underlying self-correction across multiple layers?associated_withMechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Claim that feature grounding enables interpretability metrics.
- Sweeping number of features and training steps to find compute-optimal SAE configurations.
- Extension of mechanistic interpretability findings to the metacognitive domain
- Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
- Pre-filtering step excluding latents naturally activated by each prompt to ensure genuine off-topic steering
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- Sparse dictionary learning method used to extract interpretable features from EEG transformer embeddings.