claim
active
claim:automated-interpretability-using-llms-can-usefully-score-feature-specificityAutomated interpretability using LLMs can usefully score feature specificity.
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Validation that top activations are highly specific to interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Establishes that the observed linear structure is not merely a representation of text probability
- Using Claude 3 Opus to generate feature explanations and predict held-out activations.
- We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.809Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
- Core cross-modal empirical result: larger and better language models align better with vision models
- Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
- Quantitative comparison supporting SAE utility.
- Interpretive claim connecting scale to abstraction level in LLM representations
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations