paper
active
2026
paper:pearce-goodfire-evee-genetic-variants-2026

Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions

TL;DR

Evee, a variant-pathogenicity platform built on the Evo 2 genomic foundation model, achieves 0.997 AUROC on a 839,000-variant ClinVar benchmark — outperforming all previously reported methods — while simultaneously generating mechanistic "disruption profiles" that explain *why* a variant is predicted pathogenic rather than producing a scalar score alone. Zero-shot performance on insertions and deletions reaches 0.991 AUROC, a regime where many supervised methods degrade sharply. The system scales to approximately 4.2 million variants in total, including roughly 2 million variants of uncertain significance (VUS) for which no ground-truth labels exist and for which the disruption profiles constitute the primary clinical output. In a structured human evaluation, disruption profiles scored 3.8/5 for explanation quality against 2.8/5 for metadata-only baselines — a 36% relative gain in perceived explanatory value. The method extracts these profiles from Evo 2's internal representations via a sparse-feature-attribution pipeline developed jointly by Goodfire and Mayo Clinic. The paper argues this demonstrates that mechanistic interpretability applied to large biological sequence models can close the gap between black-box accuracy and clinician-usable reasoning, making genome-wide functional annotation tractable for direct clinical deployment on VUS that currently stall diagnostic workflows.

What to take away

  1. 1. Evee achieves 0.997 AUROC on pathogenicity classification across 839,000 ClinVar variants, surpassing all previously benchmarked computational methods on that dataset.
  2. 2. Zero-shot performance on insertions and deletions reaches 0.991 AUROC, demonstrating that Evo 2's representations generalize to structural variant classes without task-specific fine-tuning.
  3. 3. The platform generates 'disruption profiles' — mechanistic, feature-level explanations derived from Evo 2's internal activations — rather than outputting only a pathogenicity probability.
  4. 4. Disruption profiles received a mean human-evaluation score of 3.8/5 for explanation quality, compared to 2.8/5 for metadata-only outputs, a 36% relative improvement in perceived explanatory adequacy.
  5. 5. Evee covers approximately 4.2 million genetic variants in total, of which roughly 2 million are variants of uncertain significance (VUS) — the class with the largest unmet clinical need.
  6. 6. The underlying model is Evo 2, a genomic foundation model trained on large-scale DNA sequence corpora and used here as a frozen encoder whose internals are probed via Goodfire's sparse-feature-attribution pipeline.
  7. 7. The work was produced in collaboration between Goodfire and Mayo Clinic (partnership announced September 2025), positioning the tool for direct integration into clinical genomic workflows.
  8. 8. An open question the work raises is whether disruption profiles derived from sequence-level model features capture biologically distinct mechanisms across variant classes (e.g., splice-site vs. missense), or whether the features conflate multiple causal pathways.
  9. 9. To replicate the explanation-quality evaluation, a researcher could present matched variant records — one with a disruption profile, one with metadata only — to clinical genomicists using a 1–5 Likert rubric and analyze results via paired Wilcoxon signed-rank test, as implicitly operationalized in this study.
  10. 10. The public tool at evee.goodfire.ai exposes variant-level disruption profiles, enabling external validation against independent VUS cohorts or disease-specific registries not included in the ClinVar benchmark.

Peer brief — for seminar discussion

This paper introduces Evee, a variant-pathogenicity prediction and explanation system built on Evo 2, a large genomic foundation model, developed jointly by Goodfire and Mayo Clinic. Rather than treating pathogenicity scoring as a pure classification problem, Evee extracts sparse feature-attribution signals from Evo 2's internal representations to produce what the authors call disruption profiles — per-variant mechanistic explanations indicating which biological processes are predicted to be disrupted. The method introduced is this sparse-feature-attribution pipeline applied to a frozen Evo 2 encoder; an alternative the work could have used is gradient-based saliency attribution (e.g., integrated gradients over input nucleotide positions), which would have provided a natural baseline for assessing whether internal-feature explanations carry information beyond input-level sensitivity maps. The load-bearing finding is a 0.997 AUROC on 839,000 ClinVar variants — the largest published benchmark of its type — and 0.991 AUROC zero-shot on insertions and deletions, both reported as state-of-the-art. Coverage extends to roughly 4.2 million variants total, including approximately 2 million variants of uncertain significance for which disruption profiles are the primary output. A structured human evaluation found that disruption profiles scored 3.8/5 versus 2.8/5 for metadata-only conditions, a 36% relative gain, with raters drawn from a clinical genomics context consistent with the Mayo Clinic collaboration. The paper's implicit prediction is that mechanistic interpretability applied at genome scale can convert model internals into clinician-actionable reasoning — that accuracy and explainability are jointly achievable, not in tension. This would matter considerably for the diagnostic pipeline around VUS, which currently stall workups precisely because high-accuracy classifiers offer no mechanistic rationale. The contestable element a critical reader should press on is the explanation-quality evaluation design. A 3.8 vs. 2.8 score differential on a 1–5 Likert scale is suggestive but the rater pool size, rater expertise level, and inter-rater reliability statistics are not prominently reported in the preprint. It is therefore unclear whether the improvement reflects genuine mechanistic informativeness or simply that any structured profile narrative scores higher than sparse metadata regardless of biological content — a form of fluency bias well-documented in LLM evaluation and plausibly operative here when model-derived prose profiles are rated against terse metadata fields. Without ablation showing that profiles generated from shuffled or randomly attributed features score lower than real profiles, the 3.8/5 figure cannot be cleanly attributed to the attribution method itself.

Methods (2)

  • Disruption profiles
    Mechanistic explanation outputs from EVEE showing how variants affect gene function, scored 3.8/5 for explanation quality.
  • Evee variant effect prediction method
    The method that predicts and explains variant pathogenicity using Evo 2, producing disruption profiles.

Frameworks (1)

  • Evo-2
    Genomic foundation model used to predict and explain variant pathogenicity.

Datasets (2)

  • 4.2M whole-genome variants dataset
    The set of 4.2 million genetic variants across the human genome for which Evee provides predictions and explanations.
  • ClinVar
    Curated database of 839k variants used to train and evaluate EVEE pathogenicity predictions.

Findings (4)

Claims (6)

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar