Automated interpretability pipeline using LLMs

Using Claude 3 Opus to generate feature explanations and predict held-out activations.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated interpretability using LLMs can usefully score feature specificity.claim0.817
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
Automated Interpretabilityframework0.813
Method using large language models (Claude) to generate and test explanations of features at scale
interpretabilityconcept0.776
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.745
Establishes that the observed linear structure is not merely a representation of text probability
Application Programming Interface Access to LLMsconcept0.737
Relatively unconstrained API access to powerful LLMs that vastly expands range of possible dialogue agent actions and risks
Interpretability as Natural Scienceframework0.732
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
interpretative methodmethod0.731
The historical/hermeneutic approach adopted by the paper to analyze cybernetic diagrams in light of Flusser’s philosophy.
Neural Network Interpretabilityconcept0.730
The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper