question

active

question:what-matrix-decomposition-or-dimensionality-reduction-best-summarizes-the-enormous-low-rank-ov-and-qk-matrices

What matrix decomposition or dimensionality reduction best summarizes the enormous low-rank OV and QK matrices?

Open methodological question about converting the 50k x 50k expanded matrices into human-graspable summaries

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Key, query, and value vectors are intermediary byproducts; W_OV and W_QK are the fundamental low-rank matrices describing attention head behaviorclaim0.771
Reframing observation: the canonical K/Q/V decomposition is computationally convenient but not the most interpretable representation
Rank-one matrix decompositionmethod0.769
Constraint in VPD where each parameter subcomponent is constrained to be a rank-one matrix for simplicity.
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.743
Interpretive claim connecting scale to abstraction level in LLM representations
The two-dimensional subspace reported by Burger et al. (2024) seems to reflect a stage of transition in the model's processing, rather than a universal property of truth directions.quote0.734
Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.728
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
QwQ and Qwen models have been extensively post-trained to excel at single-step tasks, causing degradation in long multi-turn interactions.hypothesis0.726
Proposed explanation for why single-turn reformulation improves performance: models' training distribution is concentrated on single-turn reasoning.
There are fewer representations competent for N tasks than M<N tasks, so training more general models should yield fewer possible solutionshypothesis0.725
Selective pressure toward convergence via task generality
We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.724
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.