finding

active

finding:in-a-4-features-functionally-memorize-merchantability-or-fitness-for-a-particular-purpose-via-fsa-like-feature-chain

In A/4, features functionally memorize 'MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE' via FSA-like feature chain

Demonstrates mechanistic memorization via feature assemblies in superposition

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Concepts (1)

concept

Memorization in Superposition
supports
Specific phrases or sequences memorized via binary features in superposition, enabling narrow pattern matching despite few neurons

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.finding0.766
Quantitative relationship between concept frequency and feature presence.
Four features will be fundamental: (1) all processes rethought as morphogenetic; (2) sequences link into a net; (3) continuous evolution becomes widespread; (4) an ethical obligation to heal the land emerges.claim0.752
Vision of the emerging paradigm shift in society.
Four features (A/0/20, A/0/0, A/0/30, A/0/494) form an FSA-like system implementing HTML tag generationfinding0.746
Concrete example of features connecting into FSA-like system implementing complex behavior
Single base64 feature A/0/45 splits into three distinct features in A/1: letter-specific, digit-specific, and ASCII-encoding-specificfinding0.745
Concrete example of feature splitting revealing unexpected model structure
Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute.quote0.745
Definition of the newly named empirical effect.
Geometry of features matters for representation quality.claim0.743
General principle supported tangentially by covariance pooling work; relates to feature co-occurrence structure.
The multiscale competency architecture (MCA) speeds evolutionary search by providing generalization, reliability, tractable search space, cryptic variation, and functional intermediates.claim0.742
Main functional claim about MCA.
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.742
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations