method
active
method:goodfire-sae-apiGoodfire SAE API
API providing access to sparse autoencoder features for LLaMA 3.3 70B used for feature steering in Experiment 2
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- AI research company; authors' affiliation; develops tools including EVEE and publishes research on genomic foundation models.
- The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
- API method used to identify latents differentially activated between on-topic and off-topic prompt-response pairs
- P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
- Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)
- Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
- Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
- Standard interpretability approach that VPD critiques and proposes an alternative to.