Sequential SAE Activation Analysis

Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes

Neighborhood — ranked by edge-count

dataset

146 Self-Correction Episodes from Llama-3.3-70B
uses
Dataset of confirmed self-correction episodes used for sequential activation analysis

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse Autoencoders (SAE) activation-based paradigmframework0.787
Standard interpretability approach that VPD critiques and proposes an alternative to.
Statistical Activation Analysismethod0.757
Component of the contrastive retrieval pipeline analyzing activation statistics.
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.finding0.754
Out-of-distribution generalization of SAE features.
Single-Layer SAE Analysis Limitationconcept0.741
Key limitation that prevents tracing inter-layer dynamics or how steering propagates through model depth
Sparse Autoencoders (SAE)method0.738
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
SAE Feature Conditional Firing Persistence Metricmethod0.733
P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
Textual SAE feature emotionality evaluationmethod0.725
Method where Kimi evaluates steered vs unsteered text samples from another instance to rate SAE feature emotionality (0-100)
Self-evaluated emotionality and textual evaluation of SAE features predict persistence in opposite directions.claim0.725
Surprising finding that the two evaluation methods diverge in their relationship with persistence