method
active
method:feature-density-histogramFeature Density Histogram
Log-scale histogram of feature firing rates used as proxy for autoencoder quality during hyperparameter tuning
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Ultralow Density Clusterassociated_withCluster of autoencoder features with extremely low activation density (~1e-7) that are generally not interpretable and appear to be training artifacts
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
- Mode in feature density histogram around 1e-5 corresponding to interpretable features, contrasted with ultralow density cluster
- Approximate posterior probability distribution embodied in organism's internal states; organism's best guess about causes of sensations
- Research thread within About Blank concerning the structure and relational properties of neural network feature representations; covariance pooling tangentially supports this thread.
- Features identified in Llama-3.1-8B that compute sums using periods respecting base-10 addition (2, 5, 10) rather than concept-specific periods
- Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
- The extracted set of sparse interpretable features from model embeddings via SAEs
- The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space