question
active
question:to-what-extent-do-interpretable-features-represent-the-full-story-of-the-mlp-layer

to what extent do interpretable features represent the 'full story' of the MLP layer?

Question about completeness of feature-based model explanation

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (2)

finding

Concepts (1)

concept
  • Metric measuring fraction of MLP loss contribution explained by the autoencoder by replacing MLP activations with autoencoder outputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.