claim

active

claim:cross-layer-superposition-is-a-fundamental-challenge-for-dictionary-learning

Cross-layer superposition is a fundamental challenge for dictionary learning.

Features smeared across layers cannot be fully disentangled by SAE on a single residual stream.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cross-layer superpositionconcept0.824
Representation of features spread across multiple layers, complicating dictionary learning.
Representing non-linearly separable functions requires a network with multiple layers.claim0.772
Architectural requirement from machine learning.
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.767
Explanation for why dictionary learning can recover many more features than dimensions.
Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.claim0.765
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.758
Argues against the single-layer analysis approach of prior work.
Single-layer analyses can be misleading because early-layer truth directions may reflect surface features with limited cross-task generalization.claim0.758
Methodological critique of prior work that fixed a single layer for truth probing.
Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.claim0.758
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
Single dendritic layer solves XOR-like problems with capacity matching 8-layer deep networks.finding0.755
Evidence from Beniaguev et al. (2021) that individual biological neurons vastly outperform McCulloch-Pitts model; supports hybrid computation claim.