concept
active
concept:neural-network-interpretability

Neural Network Interpretability

The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Chris Olah
    studies
    Co-author; provided high-level research guidance, wrote introduction/discussion.

Frameworks (1)

framework

Concepts (4)

concept
  • VPD achieves sparse, interpretable parameter subcomponents with improved sparsity-reconstruction tradeoff.
  • Causal abstraction
    associated_with
    A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs
  • Knowledge Localization
    associated_with
    Technique for identifying where specific knowledge is stored in neural network layers via interventions
  • The metaphor for a qualitative shift in scientific inquiry to finer-grained detail, analogous to the microscope's role in cellular biology

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.