Untied Decoder Weights

Autoencoder design choice to learn separate encoder and decoder weights, increasing representational capacity by allowing encoder vectors to distinguish similar features

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Sparse Autoencoder for Dictionary Learning
associated_with
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Interference Weightsconcept0.715
Logit weight contributions from a feature that arise due to superposition with other features, not from the feature's own causal role
Disentangled Representationsframework0.713
Adjacent ML literature on separating independent factors of variation; related to but distinct from the polysemanticity problem studied here
Equal Weightingframework0.710
Baseline MTL approach minimizing sum of task losses with equal weights; suffers from task balancing
Uncertainty Weighting (UW)method0.704
Loss balancing using homoscedastic uncertainty.
Feature neighborhood exploration via cosine similarity of decoder weightsmethod0.697
Identifying related features by cosine distance in SAE decoder space.
Disentanglementframework0.680
Related research agenda seeking representations that separate conceptually distinct factors; contrasted with superposition approach
"You can literally read meaningful algorithms off of the weights."quote0.679
Load-bearing claim about the tractability of circuit analysis; central thesis of Claim 2
Linear Decodingmethod0.676
Correlative technique measuring the type of information encoded in distributed representations via linear predictability.