method
active
method:sequential-sae-activation-analysisSequential SAE Activation Analysis
Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
Neighborhood — ranked by edge-count
Datasets (1)
dataset
- Dataset of confirmed self-correction episodes used for sequential activation analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Standard interpretability approach that VPD critiques and proposes an alternative to.
- Component of the contrastive retrieval pipeline analyzing activation statistics.
- Out-of-distribution generalization of SAE features.
- Key limitation that prevents tracing inter-layer dynamics or how steering propagates through model depth
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
- Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
- Surprising finding that the two evaluation methods diverge in their relationship with persistence