method
active
method:goodfire-sae-api

Goodfire SAE API

API providing access to sparse autoencoder features for LLaMA 3.3 70B used for feature steering in Experiment 2

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Goodfireinstitute0.817
    AI research company; authors' affiliation; develops tools including EVEE and publishes research on genomic foundation models.
  • SAE featuresconcept0.741
    The individual, supposedly monosemantic directions learned by SAEs; argued here to fragment manifolds into disconnected pieces.
  • API method used to identify latents differentially activated between on-topic and off-topic prompt-response pairs
  • P(feature fires at t+100 | fired at t) minus P(feature fires at t+100 | did not fire at t), used because SAE features are binary unlike probe activations
  • Persistence metric for SAE features: P(fires at t+100 | fired at t) minus P(fires at t+100 | did not fire at t)
  • Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.
  • Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
  • Standard interpretability approach that VPD critiques and proposes an alternative to.