method
active
method:dictionary-learning-for-neural-network-interpretability

Dictionary Learning for Neural Network Interpretability

Bricken et al.'s method for decomposing language models into interpretable features; cited as AI alignment interpretability relevant to consciousness detection

Neighborhood — ranked by edge-count

Concepts (1)

concept

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.