does the model have a feature corresponding to every major world city?

Question explored in feature completeness study.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (1)

finding

The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.
answered_by
Quantitative relationship between concept frequency and feature presence.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

World Modelsconcept0.767
Theme issue context: relates to internal models of environment, central to consciousness and cognition across substrates.
we have shown a mathematical relationship between the two modelsquote0.755
Core claim distinguishing this paper's contribution from looser representational similarity arguments.
Features can be used to steer large models.claim0.746
Clamping feature activations causally alters model behavior in interpretable ways.
Feature universality across independently trained models suggests features have some existence beyond individual modelsclaim0.745
Authors take agnostic position on ontological status but universality evidence pushes toward features being real
Lack of rigorous cross-model comparison demonstrating that specific named features (not just correlated ones) form across architecturesquestion0.744
Explicitly identified research gap: anecdotal evidence exists but rigorous characterization is absent
Models that are competent all represent data in a similar way; all strong models are alike, each weak model is weak in its own wayclaim0.744
Author's interpretation of the VTAB alignment results echoing Tolstoy
Analogous features and circuits form across models and tasks.claim0.742
Third of three speculative claims asserting that learned features are not model-specific but represent universal solutions to learning problems
Base models are good modellers of worlds but not of their own state, because they lack a developed self-model initially.claim0.737
Observation about asymmetry in base model capabilities.