Feature Density Histogram

Log-scale histogram of feature firing rates used as proxy for autoencoder quality during hyperparameter tuning

Neighborhood — ranked by edge-count

concept

Ultralow Density Cluster
associated_with
Cluster of autoencoder features with extremely low activation density (~1e-7) that are generally not interpretable and appear to be training artifacts

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature Densityconcept0.863
Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
High Density Feature Clusterconcept0.764
Mode in feature density histogram around 1e-5 corresponding to interpretable features, contrasted with ultralow density cluster
Recognition Densityconcept0.741
Approximate posterior probability distribution embodied in organism's internal states; organism's best guess about causes of sensations
Geometry of featuresconcept0.731
Research thread within About Blank concerning the structure and relational properties of neural network feature representations; covariance pooling tangentially supports this thread.
Fourier featuresconcept0.728
Features identified in Llama-3.1-8B that compute sums using periods respecting base-10 addition (2, 5, 10) rather than concept-specific periods
Feature Visualizationmethod0.724
Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
Sparse Feature Dictionaryconcept0.723
The extracted set of sparse interpretable features from model embeddings via SAEs
Linear Representation of Featuresconcept0.722
The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space