claim

active

claim:saes-can-surface-features-relevant-to-meta-cognitive-monitoring-not-just-object-level-content-representation

SAEs can surface features relevant to meta-cognitive monitoring, not just object-level content representation

Extension of mechanistic interpretability findings to the metacognitive domain

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

SAE features can be grounded in clinical taxonomy (abnormality, age, sex, medication) to benchmark monosemanticity and entanglement.claim0.851
Claim that feature grounding enables interpretability metrics.
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.817
Out-of-distribution generalization of SAE features.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.809
A promising property for interpretability analysis off-distribution.
Our SAEs' features are more interpretable than neurons.claim0.809
Automated interpretability and specificity ratings show SAE features are clearer than MLP neurons.
SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.claim0.800
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.797
Surprising finding that the two evaluation methods diverge in their relationship with persistence
Larger SAEs contain features for concepts not captured in smaller SAEs, indicating improved coverage.claim0.797
Scaling SAE size increases granularity and discovers new features.
Agentic self-evaluation and self-steering may scale to broadly interpret and understand internal representations and SAE features.claim0.795
Forward-looking claim about the potential of model introspection as an interpretability tool