claim

active

claim:automated-interpretability-using-llms-can-usefully-score-feature-specificity

Automated interpretability using LLMs can usefully score feature specificity.

Claude 3 Opus ratings aligned with human judgment of feature descriptions.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

For four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.
supports
Validation that top activations are highly specific to interpretation.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.820
Establishes that the observed linear structure is not merely a representation of text probability
Automated interpretability pipeline using LLMsmethod0.817
Using Claude 3 Opus to generate feature explanations and predict held-out activations.
We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.809
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.
The better an LLM is at language modeling, the more it aligns with vision models, and vice versa — linear relationship between language modeling score and vision-language alignmentfinding0.798
Core cross-modal empirical result: larger and better language models align better with vision models
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.797
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.797
Quantitative comparison supporting SAE utility.
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.797
Interpretive claim connecting scale to abstraction level in LLM representations
In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.792
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations