method
active
method:automated-interpretability-pipeline-using-llmsAutomated interpretability pipeline using LLMs
Using Claude 3 Opus to generate feature explanations and predict held-out activations.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.
- Method using large language models (Claude) to generate and test explanations of features at scale
- The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
- Establishes that the observed linear structure is not merely a representation of text probability
- Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks
- Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
- The historical/hermeneutic approach adopted by the paper to analyze cybernetic diagrams in light of Flusser’s philosophy.
- The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper