claim
active
claim:learned-features-reflect-the-functionality-of-the-model-and-not-just-the-data-distribution-as-evidenced-by-interpretable-downstream-effects-not-used-in-dictionary-learning

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning

Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations

Source paper

extracted_from
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (3)

finding

Methods (3)

method

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.