artifact
active
artifact:feature-visualization-interface-transformer-circuits-pub-2023-monosemantic-features-visFeature Visualization Interface (transformer-circuits.pub/2023/monosemantic-features/vis/)
Interactive interface for exploring all 90 learned dictionaries' features, including activating examples, logit effects, and ablations
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries