finding

active

finding:learned-rotation-matrices-are-non-trivial-majority-of-basis-vectors-are-rotated-indicating-highly-distributed-representations

Learned rotation matrices are non-trivial: majority of basis vectors are rotated, indicating highly distributed representations

Learned rotations reveal that direct probes over standard activation bases would miss the actual causal role of representations.

Source paper

extracted_from

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

(2023) · Atticus Geiger · Zhengxuan Wu · Christopher Potts · Thomas Icard +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributed
supports
Supported by the finding that non-trivial rotations are required to find aligned representations.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Learned simulations can be partially observed and lazily-rendered, and still work.claim0.754
One of the updates about prosaic ML simulation.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.751
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.743
Selective pressure toward convergence via task generality
Key, query, and value vectors are intermediary byproducts; W_OV and W_QK are the fundamental low-rank matrices describing attention head behaviorclaim0.738
Reframing observation: the canonical K/Q/V decomposition is computationally convenient but not the most interpretable representation
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.738
Core applied contribution claim, supported by top-k accuracy comparisons.
Certain representation learning algorithms boil down to a simple rule: find an embedding in which similarity equals PMIclaim0.736
Core theoretical claim about the target of representation learning
Positional encodings inferred on the fly from previously learned structures would offer fruitful research direction for language, maths, and logicclaim0.734
Forward-looking interpretive claim about the implications of recurrent position encodings for NLP research.
There is a bidirectional relationship between the geometry of representation and behavior across tasks and modalities.claim0.732
Author’s interpretive claim that the shared geometry is general and robust.