paper:doi-10-64898-2026-04-10-717844EVEE: Interpretable variant effect prediction from genomic foundation model embeddings
Original abstract (expand)
Abstract Predicting the clinical significance of genetic variants remains a central challenge in genomic medicine, with most observed variants classified as variants of uncertain significance. Here we show that representations from Evo 2, a 7-billion-parameter genomic foundation model, support accurate and interpretable pathogenicity prediction across variant types from a single framework. An embedding-based classifier, or “probe”, trained on Evo 2 embeddings achieves state-of-the-art performance across single nucleotide variant consequence types (0.997 overall AUROC on 833k ClinVar variants) and generalizes zero-shot to indels (0.991 AUROC), outperforming bioinformatic meta-predictors, protein models, and existing foundation model approaches. Performance is robust across conservation levels and transfers to deep mutational scanning datasets for BRCA1, BRCA2, TP53, and LDLR. To make these predictions interpretable, we train supervised annotation probes to quantify predicted disruptions caused by each variant, then synthesize these disruption profiles into natural language explanations using a frontier reasoning model. We provide pre-computed predictions and on-demand explanations for all 4.2 million ClinVar variants through the Evo Variant Effect Explorer (EVEE), an interactive web resource for the community. This work establishes that representations from genomic foundation models can serve as a unified substrate for both accurate variant effect prediction and mechanistic interpretation, reframing interpretability in computational genomics from a trade-off into a complementary product of learned biological structure.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictionsin corpus2026≈ 93%
- Unveiling interpretable development-specific gene signatures in the developing human prefrontal cortex with ICGSXiucai Ye (1 and 2), Tetsuya Sakurai (1 and 2) ((1) University of Tsukuba, (2) Center for Artificial Intelligence Research in University of Tsukuba) Meng Huang (1)2022≈ 81%
- EVA: Towards a universal model of the immune systemVincent Bouget, Apolline Bruley, Yannis Cattan, Charlotte Claye, Matthew Corney, Julien Duquesne, Karim El Kanbi, Aziz Fouch\'e, Pierre Marschall, Francesco Strozzi Scienta Team: Ethan Bandasack2026≈ 81%
- Entropy, Disagreement, and the Limits of Foundation Models in GenomicsLovro Vr\v{c}ek, Mile \v{S}iki\'c Maxime Rochkoulets2026≈ 80%
- BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial ResistanceMehrdad Shoeibi, Ivan Garibay and Niloofar Yousefi Elias Hossain2026≈ 80%
- GenoBERT: A Language Model for Accurate Genotype ImputationChuan Qiu, Kuan-Jui Su, Anqi Liu, Yun Gong, Weiqiang Lin, Lindong Jiang, Chen Zhao, Meng Song, Jeffrey Deng, Qing Tian, Zhe Luo, Ping Gong, Hui Shen, Chaoyang Zhang, and Hong-Wen Deng Lei Huang2026≈ 80%
- Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2Paulo Yanez Sarmiento, Bernhard Y. Renard Isabel Kurth2026≈ 80%
- Learning biologically relevant features in a pathology foundation model using sparse autoencodersCiyue Shen, Neel Patel, Chintan Shah, Darpan Sanghavi, Blake Martin, Alfred Eng, Daniel Shenker, Harshith Padigela, Raymond Biju, Syed Ashar Javed, Jennifer Hipp, John Abel, Harsha Pokkalla, Sean Grullon, Dinkar Juyal Nhat Minh Le2024≈ 79%
- Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AIKe Chen and Haohan Wang2025≈ 79%
- ≈ 79%
- Ultrafast topological data analysis reveals pandemic-scale dynamics of convergent evolutionLukas Hahn, Maximilian Neumann, Zachary Ardern, Juan Angel Patino-Galindo, Mathieu Carriere, Ulrich Bauer, Raul Rabadan, Andreas Ott Michael Bleher2026≈ 79%
- MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and StratificationAdam Khadre, Ruben C Petreaca, Golrokh Mirzaei Yifan Dou2025≈ 78%
- Covariance-based Sequence Poolingin corpus2026≈ 78%
- A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational AutoencodersRajiv Misra, Sanjay Kumar Singh, Anisha Roy Dip Roy2026≈ 78%
- Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug ResponseKaren Rafferty, Hui Wang Christopher Baker2026≈ 78%
- Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality PredictionFeng Dong, Andreas Karwath Chris Sainsbury2026≈ 78%
- When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation BiologyHumza Nusrat and Omar Nusrat2025≈ 78%
- Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI AgentsCen Wan and Alex A. Freitas2026≈ 78%
- Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studiesin corpus2023≈ 76%
- ≈ 76%
- ≈ 76%
- Anima Labs Phenomenology Pt1in corpus≈ 75%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 75%
- Active Inference, Curiosity and Insightin corpus2017≈ 75%
- ≈ 75%
- ≈ 74%
- Model Alignment Searchin corpus2025≈ 74%
- ≈ 74%