claim

active

claim:the-resulting-features-are-highly-abstract-multilingual-multimodal-and-generalizing-between-concrete-and-abstract-references

The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.

Features respond to concepts across languages and in images, not just text.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Neighborhood — ranked by edge-count

Findings (2)

finding

Golden Gate Bridge feature [34M/31164353] fires strongly on Wikipedia snippets in Chinese, Japanese, Korean, Russian, Vietnamese, Greek.
supports
Demonstrates multilingual generalization of SAE features.
SAE features trained on text activations generalize to image inputs, activating on relevant visual depictions.
supports
Out-of-distribution generalization of SAE features.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

An interplay between causal abstraction and feature geometry deepens mechanistic understanding of language modelsclaim0.794
Methodological claim about the scientific value of combining causal abstraction with representational geometry analysis
The features are often organized in geometrically-related clusters that share a semantic relationship.claim0.775
Decoder cosine similarity maps onto concept similarity.
Four features will be fundamental: (1) all processes rethought as morphogenetic; (2) sequences link into a net; (3) continuous evolution becomes widespread; (4) an ethical obligation to heal the land emerges.claim0.775
Vision of the emerging paradigm shift in society.
How do representations differ or converge between architectures, tasks, and modalities?question0.774
Broader research question MAS is positioned to address, citing multiple recent works.
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.770
Motivation for using sparsity-based dictionary learning on language models
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.770
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
what is the precise characterization of the perceptual representational format in which information exists as a specific, sustained, globally constitutive structure?question0.769
Open theoretical problem CIMC acknowledges: precisely characterizing the representational format of perception
Abstract performatives include purely internal actions such as commitments not necessarily expressed in output, but on whose fulfillment the correctness of the program depends.quote0.768
Definition of abstract performative, a core invention of the paper.