claim

active

claim:the-isotropic-superposition-model-is-incomplete-because-features-cluster-into-higher-density-groups-due-to-correlated-activations-and-similar-downstream-actions

The isotropic superposition model is incomplete because features cluster into higher-density groups due to correlated activations and similar downstream actions

Authors revise their own prior Toy Models framework based on evidence from feature splitting and geometry

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Isotropic Superposition Model
contradicts
Prior model of superposition where features are discrete 1D objects repelling each other roughly evenly; paper argues this is incomplete

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.764
Explanation for why dictionary learning can recover many more features than dimensions.
Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.claim0.758
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
Results collectively provide strong evidence that some version of the superposition hypothesis and linear representation hypothesis is trueclaim0.728
Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights
Superposition Hypothesisframework0.721
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
2D projections of activations show clearly separable clusters for F0-F2 and A1 at layer 25, but increasingly entangled activations for F4-F5 and A2-A3.finding0.719
Visual geometric evidence for the fundamental entanglement of true/false activations in harder tasks.
Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accuratefinding0.718
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
We hypothesise that ecological models fall short of demonstrating spontaneous evolution of a new level of individuality because they are single-level networks of symmetric interactions.hypothesis0.714
Explains limitation of current ecological connectionist models.
Similar superposition phenomena may exist in self-attention layers and similar sparse autoencoder methods may extract useful structure from attentionhypothesis0.713
Extension of superposition hypothesis to attention layers as future research direction