SoLU Activation Function

Prior Anthropic approach to increasing neuron monosemanticity via activation function design; found to make some neurons more interpretable at cost of others

Neighborhood — ranked by edge-count

Claims (1)

claim

Training models with sparse activations cannot fully prevent polysemanticity because cross-entropy loss creates incentives for polysemantic neurons even without superposition
extends
Author's conclusion after extensive investigation of architectural approaches to monosemanticity

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GeLU Activation Functionconcept0.804
The nonlinear activation function used in MLP layers; prevents the linearization approach used for attention layers from extending to MLP layers
Activationsconcept0.727
Internal representations of the model on which probes operate; the method uses activations to rank datapoints.
Sequential SAE Activation Analysismethod0.694
Token-level analysis of OTD and backtracking latent activations aligned at correction points across episodes
Activation Additionmethod0.692
Intervention method that adds a learned direction vector to residual stream activations to steer model behavior
Statistical Activation Analysismethod0.690
Component of the contrastive retrieval pipeline analyzing activation statistics.
Activation Probingconcept0.687
Technique of reading out model beliefs from internal activations before the final answer token is generated
Other-Referencing Activationsconcept0.686
Latent model activations when processing inputs framed from another agent's perspective
Activation decompositionconcept0.686
The conventional approach (e.g., SAEs, transcoders) of decomposing activations into interpretable features.