method
active
method:feature-density-histogram

Feature Density Histogram

Log-scale histogram of feature firing rates used as proxy for autoencoder quality during hyperparameter tuning

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Cluster of autoencoder features with extremely low activation density (~1e-7) that are generally not interpretable and appear to be training artifacts

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Feature Densityconcept0.863
    Fraction of training tokens on which a given feature has nonzero activation; used as proxy metric for autoencoder quality
  • Mode in feature density histogram around 1e-5 corresponding to interpretable features, contrasted with ultralow density cluster
  • Approximate posterior probability distribution embodied in organism's internal states; organism's best guess about causes of sensations
  • Research thread within About Blank concerning the structure and relational properties of neural network feature representations; covariance pooling tangentially supports this thread.
  • Fourier featuresconcept0.728
    Features identified in Llama-3.1-8B that compute sums using periods respecting base-10 addition (2, 5, 10) rather than concept-specific periods
  • Method of optimizing input to cause a neuron to fire maximally, used to characterize what a neuron detects; establishes causal link
  • The extracted set of sparse interpretable features from model embeddings via SAEs
  • The central object of study — the idea that a concept like truth is encoded as a direction in the LLM's latent space