Feature completeness search using LLM-generated queries

Using Claude to search for features activating on specific concepts and automated labeling.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Automated interpretability using LLMs can usefully score feature specificity.claim0.777
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
LLMs trained only on language data have rich enough knowledge of visual structures that decent visual representations can be trained on images generated solely by querying the LLMfinding0.761
Sharma et al. result supporting cross-modal alignment: language-only models implicitly encode visual structure
How can we decide if our selection of examples is complete?question0.748
Central question motivating attribute exploration.
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs aloneclaim0.746
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings
Do LLMs have a unified representation of truth that spans structurally and topically diverse data?question0.744
Central research question driving dataset design and experimental approach
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.744
Establishes that the observed linear structure is not merely a representation of text probability
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.740
Interpretive claim connecting scale to abstraction level in LLM representations
We hypothesize that LLMs represent correctness of arithmetic expressions differently from factual statements.hypothesis0.739
Core working hypothesis motivating the factual vs. arithmetic task split in the experimental design.