paper:pearce-goodfire-evee-genetic-variants-2026Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions
TL;DR
Evee, a variant-pathogenicity platform built on the Evo 2 genomic foundation model, achieves 0.997 AUROC on a 839,000-variant ClinVar benchmark — outperforming all previously reported methods — while simultaneously generating mechanistic "disruption profiles" that explain *why* a variant is predicted pathogenic rather than producing a scalar score alone. Zero-shot performance on insertions and deletions reaches 0.991 AUROC, a regime where many supervised methods degrade sharply. The system scales to approximately 4.2 million variants in total, including roughly 2 million variants of uncertain significance (VUS) for which no ground-truth labels exist and for which the disruption profiles constitute the primary clinical output. In a structured human evaluation, disruption profiles scored 3.8/5 for explanation quality against 2.8/5 for metadata-only baselines — a 36% relative gain in perceived explanatory value. The method extracts these profiles from Evo 2's internal representations via a sparse-feature-attribution pipeline developed jointly by Goodfire and Mayo Clinic. The paper argues this demonstrates that mechanistic interpretability applied to large biological sequence models can close the gap between black-box accuracy and clinician-usable reasoning, making genome-wide functional annotation tractable for direct clinical deployment on VUS that currently stall diagnostic workflows.
What to take away
- 1. Evee achieves 0.997 AUROC on pathogenicity classification across 839,000 ClinVar variants, surpassing all previously benchmarked computational methods on that dataset.
- 2. Zero-shot performance on insertions and deletions reaches 0.991 AUROC, demonstrating that Evo 2's representations generalize to structural variant classes without task-specific fine-tuning.
- 3. The platform generates 'disruption profiles' — mechanistic, feature-level explanations derived from Evo 2's internal activations — rather than outputting only a pathogenicity probability.
- 4. Disruption profiles received a mean human-evaluation score of 3.8/5 for explanation quality, compared to 2.8/5 for metadata-only outputs, a 36% relative improvement in perceived explanatory adequacy.
- 5. Evee covers approximately 4.2 million genetic variants in total, of which roughly 2 million are variants of uncertain significance (VUS) — the class with the largest unmet clinical need.
- 6. The underlying model is Evo 2, a genomic foundation model trained on large-scale DNA sequence corpora and used here as a frozen encoder whose internals are probed via Goodfire's sparse-feature-attribution pipeline.
- 7. The work was produced in collaboration between Goodfire and Mayo Clinic (partnership announced September 2025), positioning the tool for direct integration into clinical genomic workflows.
- 8. An open question the work raises is whether disruption profiles derived from sequence-level model features capture biologically distinct mechanisms across variant classes (e.g., splice-site vs. missense), or whether the features conflate multiple causal pathways.
- 9. To replicate the explanation-quality evaluation, a researcher could present matched variant records — one with a disruption profile, one with metadata only — to clinical genomicists using a 1–5 Likert rubric and analyze results via paired Wilcoxon signed-rank test, as implicitly operationalized in this study.
- 10. The public tool at evee.goodfire.ai exposes variant-level disruption profiles, enabling external validation against independent VUS cohorts or disease-specific registries not included in the ClinVar benchmark.
Peer brief — for seminar discussion
This paper introduces Evee, a variant-pathogenicity prediction and explanation system built on Evo 2, a large genomic foundation model, developed jointly by Goodfire and Mayo Clinic. Rather than treating pathogenicity scoring as a pure classification problem, Evee extracts sparse feature-attribution signals from Evo 2's internal representations to produce what the authors call disruption profiles — per-variant mechanistic explanations indicating which biological processes are predicted to be disrupted. The method introduced is this sparse-feature-attribution pipeline applied to a frozen Evo 2 encoder; an alternative the work could have used is gradient-based saliency attribution (e.g., integrated gradients over input nucleotide positions), which would have provided a natural baseline for assessing whether internal-feature explanations carry information beyond input-level sensitivity maps. The load-bearing finding is a 0.997 AUROC on 839,000 ClinVar variants — the largest published benchmark of its type — and 0.991 AUROC zero-shot on insertions and deletions, both reported as state-of-the-art. Coverage extends to roughly 4.2 million variants total, including approximately 2 million variants of uncertain significance for which disruption profiles are the primary output. A structured human evaluation found that disruption profiles scored 3.8/5 versus 2.8/5 for metadata-only conditions, a 36% relative gain, with raters drawn from a clinical genomics context consistent with the Mayo Clinic collaboration. The paper's implicit prediction is that mechanistic interpretability applied at genome scale can convert model internals into clinician-actionable reasoning — that accuracy and explainability are jointly achievable, not in tension. This would matter considerably for the diagnostic pipeline around VUS, which currently stall workups precisely because high-accuracy classifiers offer no mechanistic rationale. The contestable element a critical reader should press on is the explanation-quality evaluation design. A 3.8 vs. 2.8 score differential on a 1–5 Likert scale is suggestive but the rater pool size, rater expertise level, and inter-rater reliability statistics are not prominently reported in the preprint. It is therefore unclear whether the improvement reflects genuine mechanistic informativeness or simply that any structured profile narrative scores higher than sparse metadata regardless of biological content — a form of fluency bias well-documented in LLM evaluation and plausibly operative here when model-derived prose profiles are rated against terse metadata fields. Without ablation showing that profiles generated from shuffled or randomly attributed features score lower than real profiles, the 3.8/5 figure cannot be cleanly attributed to the attribution method itself.
Methods (2)
- Disruption profilesMechanistic explanation outputs from EVEE showing how variants affect gene function, scored 3.8/5 for explanation quality.
- Evee variant effect prediction methodThe method that predicts and explains variant pathogenicity using Evo 2, producing disruption profiles.
Frameworks (1)
- Evo-2Genomic foundation model used to predict and explain variant pathogenicity.
Datasets (2)
- 4.2M whole-genome variants datasetThe set of 4.2 million genetic variants across the human genome for which Evee provides predictions and explanations.
- ClinVarCurated database of 839k variants used to train and evaluate EVEE pathogenicity predictions.
Findings (4)
- Disruption profiles scored 3.8/5 for explanation quality vs 2.8/5 for metadata-only baselines
EVEE's mechanistic explanations significantly outperform simple metadata-based predictions in human evaluation.
- Mechanistic explanations provided for ~2M variants of uncertain significance
Scale of interpretability output, addressing a major clinical need for VUS resolution.
- 0.997 AUROC on pathogenicity prediction for 839k ClinVar variants
EVEE achieves state-of-the-art performance on variant pathogenicity classification, outperforming existing methods.
- 0.991 AUROC zero-shot on insertions/deletions
EVEE demonstrates strong generalization to indels without explicit training, indicating learned mechanistic principles.
Claims (6)
- EVEE provides mechanistic explanations for variant effects derived from model internals, not just pathogenicity calls.
Core interpretability claim distinguishing EVEE from black-box prediction tools; applies interpretability for science.
- Evee provides predictions and mechanistic explanations for 4.2 million genetic variants across the whole human genome
Scale claim, demonstrating whole-genome applicability.
- Model internals of genomic foundation models can yield mechanistic explanations for variant effects
Foundational interpretability claim that the paper exemplifies.
- Disruption profiles are higher quality explanations than metadata-only descriptions
Claim supported by the 3.8 vs 2.8 human rating finding.
- Interpretable predictions can help resolve variants of uncertain significance
Motivating claim that mechanistic explanations add clinical value for VUS.
- Evee outperforms existing methods for variant pathogenicity prediction
Interpretive claim supported by the high AUROC findings.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- EVA: Towards a universal model of the immune systemVincent Bouget, Apolline Bruley, Yannis Cattan, Charlotte Claye, Matthew Corney, Julien Duquesne, Karim El Kanbi, Aziz Fouch\'e, Pierre Marschall, Francesco Strozzi Scienta Team: Ethan Bandasack2026≈ 81%
- BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial ResistanceMehrdad Shoeibi, Ivan Garibay and Niloofar Yousefi Elias Hossain2026≈ 81%
- Ultrafast topological data analysis reveals pandemic-scale dynamics of convergent evolutionLukas Hahn, Maximilian Neumann, Zachary Ardern, Juan Angel Patino-Galindo, Mathieu Carriere, Ulrich Bauer, Raul Rabadan, Andreas Ott Michael Bleher2026≈ 80%
- Unveiling interpretable development-specific gene signatures in the developing human prefrontal cortex with ICGSXiucai Ye (1 and 2), Tetsuya Sakurai (1 and 2) ((1) University of Tsukuba, (2) Center for Artificial Intelligence Research in University of Tsukuba) Meng Huang (1)2022≈ 80%
- ≈ 79%
- When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation BiologyHumza Nusrat and Omar Nusrat2025≈ 79%
- Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2Paulo Yanez Sarmiento, Bernhard Y. Renard Isabel Kurth2026≈ 79%
- A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational AutoencodersRajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy2026≈ 79%
- ≈ 79%
- GenoBERT: A Language Model for Accurate Genotype ImputationChuan Qiu, Kuan-Jui Su, Anqi Liu, Yun Gong, Weiqiang Lin, Lindong Jiang, Chen Zhao, Meng Song, Jeffrey Deng, Qing Tian, Zhe Luo, Ping Gong, Hui Shen, Chaoyang Zhang, and Hong-Wen Deng Lei Huang2026≈ 78%
- Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AIKe Chen and Haohan Wang2025≈ 78%
- Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug ResponseKaren Rafferty, Hui Wang Christopher Baker2026≈ 78%
- Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality PredictionFeng Dong, Andreas Karwath Chris Sainsbury2026≈ 78%
- ≈ 78%
- ≈ 78%
- Systematic Evaluation of Single-Cell Foundation Model Interpretability Reveals Attention Captures Co-Expression Rather Than Unique Regulatory SignalIhor Kendiukhov2026≈ 78%
- Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studiesin corpus2023≈ 77%
- ≈ 77%
- Covariance-based Sequence Poolingin corpus2026≈ 77%
- ≈ 77%
- Darwin's agential materials: evolutionary implications of multiscale competency in developmental biologyin corpus2023≈ 77%
- Anima Labs Phenomenology Pt1in corpus≈ 77%
- ≈ 76%
- Alignment faking in large language modelsin corpus2024≈ 76%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 76%
- Active Inference, Curiosity and Insightin corpus2017≈ 76%
- ≈ 76%