paper
active
2026
paper:dooms-goodfire-covariance-pooling-2026

Covariance-based Sequence Pooling

ByThomas Dooms·Nicholas K. Wang·Michael T. Pearce

TL;DR

Covariance pooling — replacing mean pooling with second-moment statistics over token embeddings — yields a +52.9% gain in R² on genomic track prediction and a +8.4% AUC improvement on Gene Ontology prediction when applied to genomic foundation models. The method, introduced by Dooms, Wang, and Pearce at Goodfire, computes pairwise feature co-occurrence structure across a sequence's token embeddings rather than collapsing them to a single mean vector, thereby preserving joint activation patterns that mean pooling discards by construction. These gains hold with unsupervised autoencoder embeddings, requiring no large labeled datasets — the compact covariance representations are derived from gigabytes of raw activations compressed into stable, fixed-size matrices. The method emerged as a methodological side-product of the EVEE/Mayo collaboration, suggesting its origin was empirical rather than theoretical. The deeper claim is a structural one: first moments (means) are insufficient summaries of embedding geometry whenever the discriminative signal lives in feature co-occurrence rather than marginal activation levels. The paper argues this implies mean-pooling baselines are systematically underperforming across any domain where token interaction structure is predictively relevant, making covariance pooling a candidate default for sequence-level representation in genomic and potentially other foundation-model pipelines.

What to take away

  1. 1. Covariance pooling over token embeddings from genomic foundation models achieves a +52.9% R² improvement over mean pooling on a genomic track prediction benchmark.
  2. 2. On Gene Ontology prediction, covariance pooling with unsupervised autoencoder embeddings raises AUC by +8.4% relative to the mean-pooling baseline.
  3. 3. The method requires no large labeled datasets, deriving compact stable embeddings by compressing gigabytes of raw model activations into fixed-size second-moment matrices.
  4. 4. The central methodological claim is that mean pooling discards joint activation structure (feature co-occurrence) by collapsing to first moments, while covariance pooling retains pairwise feature statistics across all sequence positions.
  5. 5. The covariance pooling method was developed as a side-product of the EVEE/Mayo genomics collaboration at Goodfire, indicating the benchmark tasks were drawn from real applied genomics workflows rather than synthetic benchmarks.
  6. 6. To replicate the core comparison, a researcher would compute the full token-by-token covariance matrix from a frozen genomic foundation model's residual stream, flatten or compress it, and train a linear probe alongside a matched mean-pooling probe on the same split.
  7. 7. The authors raise the open question of whether the second-moment advantage generalizes beyond genomics to any sequence domain where token interaction structure — rather than marginal activation levels — carries the predictive signal.
  8. 8. Goodfire's framing positions covariance pooling as a candidate default aggregation method for genomic foundation models, implicitly predicting that mean-pooling baselines in published genomics benchmarks are systematically underreported in their ceiling.
  9. 9. The unsupervised autoencoder embedding condition (yielding +8.4% AUC on Gene Ontology) is specifically notable because it demonstrates the gain does not depend on task-supervised representation learning.
  10. 10. Covariance pooling operates on the same frozen model activations as mean pooling, meaning the computational overhead is confined to the aggregation step and does not require retraining or fine-tuning the underlying genomic foundation model.

Peer brief — for seminar discussion

Dooms, Wang, and Pearce at Goodfire introduce covariance pooling, a sequence aggregation method that replaces the conventional mean pooling of token embeddings with a second-moment (covariance) summary computed across all token positions in a sequence. Applied to genomic foundation models, the method computes pairwise feature co-occurrence statistics from the full token embedding matrix, producing a compact fixed-size representation without requiring labeled data or model fine-tuning. The load-bearing finding is a +52.9% R² improvement on a genomic track prediction task and a +8.4% AUC gain on Gene Ontology prediction, both relative to mean-pooling baselines, with the Gene Ontology result obtained using unsupervised autoencoder embeddings. These are not marginal gains: a 52.9% R² lift suggests mean pooling is severely misspecified for this representational regime, not merely suboptimal. The paper argues the mechanism is structural — mean pooling is a first-moment statistic and is lossless only when the downstream signal is linear in marginal activations; when discriminative information lives in feature co-activation patterns (which is likely in genomic sequences where regulatory motif combinations matter), the mean discards the signal by construction. An alternative aggregation strategy the work does not benchmark against is attention-weighted pooling or CLS-token projection, which is a notable omission since those methods also attempt to preserve relational structure across positions, though at higher parametric cost. The method originated as a methodological byproduct of the EVEE/Mayo collaboration, and the two benchmark tasks — genomic track prediction and Gene Ontology classification — reflect that applied context. The broader hypothesis, stated explicitly, is that the second-moment advantage should generalize beyond genomics to any domain where token interaction geometry is predictively relevant. A critical reader would push back on external validity: both benchmark tasks are from a single application domain (genomics) and likely share architectural priors — the genomic foundation models used are not named in the available summary, making it impossible to assess whether the gains are model-specific (e.g., tied to a particular tokenization scheme or embedding dimensionality that inflates covariance signal). Until covariance pooling is evaluated on at least one non-genomic sequence model (e.g., a protein language model like ESM-2 650M or a text transformer), the generalization claim rests on theoretical argument rather than evidence. The compression from gigabytes of activations to stable matrices is also underspecified — the exact rank reduction or autoencoder architecture used in the unsupervised condition would materially affect reproducibility.

Methods (1)

Findings (3)

Claims (4)

Questions (1)

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar