hypothesis

active

hypothesis:features-may-not-be-strictly-one-dimensional-objects-higher-dimensional-feature-manifolds-may-exist-in-model-representations

Features may not be strictly one-dimensional objects; higher-dimensional feature manifolds may exist in model representations

Extension of superposition hypothesis to account for continuous families of features

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Geometry of features matters for representation quality.claim0.803
General principle supported tangentially by covariance pooling work; relates to feature co-occurrence structure.
Feature Manifoldsframework0.801
Hypothesized extension of superposition where features may be higher-dimensional manifolds rather than 1D directions
Feature universality across independently trained models suggests features have some existence beyond individual modelsclaim0.780
Authors take agnostic position on ontological status but universality evidence pushes toward features being real
Lack of rigorous cross-model comparison demonstrating that specific named features (not just correlated ones) form across architecturesquestion0.777
Explicitly identified research gap: anecdotal evidence exists but rigorous characterization is absent
Representational abstraction of truth may emerge more clearly with model scaleclaim0.774
Interpretation of weaker PCA separation and lower ASR in smaller models
The two-dimensional subspace reported by Burger et al. (2024) seems to reflect a stage of transition in the model's processing, rather than a universal property of truth directions.quote0.773
Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
We hypothesize that representation geometry drives model behavior — the geometric structure of internal representations causally shapes what models do externally.hypothesis0.769
The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.
Features can be used to steer large models.claim0.768
Clamping feature activations causally alters model behavior in interpretable ways.