claim

active

claim:results-collectively-provide-strong-evidence-that-some-version-of-the-superposition-hypothesis-and-linear-representation-hypothesis-is-true

Results collectively provide strong evidence that some version of the superposition hypothesis and linear representation hypothesis is true

Authors' overall conclusion from number of interpretable features, activation-level correspondence to intensity, sensible logit weights, and interference weights

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Superposition Hypothesis
supports
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear Representation Hypothesisframework0.800
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.776
Explanation for why dictionary learning can recover many more features than dimensions.
Truth may be linearly separable in the model's representation space, but the structure is richer than a single linear axisclaim0.769
Interpretive synthesis of DIM and cone intervention successes
Non-Linear Representation Hypothesisconcept0.765
Hypothesis that information may be encoded in arbitrary non-linear subspaces of a neural network
Does the multi-directional nature of truth imply an underlying nonlinear representation, or is it compatible with linear separability?question0.762
Theoretical open question about the geometry of truth in LLMs raised in Discussion
The results generalise readily to non-equilibrium systems where scaling relationships remain similar (e.g., dynamic or localised scaling).claim0.757
Claim about broader applicability of the scaling argument
The relationship between representations of truth of input statements and of model outputs in conjunction with model performance has not been investigated.question0.756
Future work direction identified in conclusion for enabling reliable truth assessment methods.
Early causal abstraction methods (Geiger et al. 2021) implicitly rely on the privileged bases hypothesis, while recent methods (Geiger et al. 2024b) rely on the linear representation hypothesisclaim0.751
Historical framing of how representation assumptions have evolved in causal interpretability