Genomic Co-occurrence Analysis: DNMT3A R882H / IDH2 R140Q / SETBP1 G870S / PTPN11 E76Q / EZH2 V662A in Myeloid Malignancies

Røine, Henrik G.

Methods & Reproducibility

Pipeline architecture, data provenance, statistical methods, computational tools

Pipeline Architecture

The analysis pipeline proceeds in thirteen sequential stages, each producing traceable intermediate outputs:

1. Data acquisition: GENIE v19.0 ^{17[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
PubMedDOI} downloaded from Synapse under a signed Data Use Agreement (syn53210). Raw mutation, clinical, and panel data ingested via genie_loader.py.

2. Filtering: Restricted to myeloid OncoTree codes (AML, MDS, MPN, CMML, MDS/MPN, JMML) ^{12[12] E 2016
Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016)
PubMedDOI}. Coding variants only: Intron, Silent, UTR, Flank, IGR, RNA, and Splice_Region excluded. Hypermutation filter removes samples with >20 coding mutations in the 34 target genes.

3. Panel adjustment: For each gene pair, only patients whose sequencing panel covers both genes are included in the denominator. This corrects for heterogeneous panel designs across 40+ contributing centers ^{17[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
PubMedDOI}.

4. Co-occurrence testing: Fisher's exact test (two-sided) applied to all 190 pairwise combinations of the top 20 mutated myeloid genes, following the approach of Kandoth et al. ^{31[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
PubMedDOI}. Produces observed/expected ratios and raw p-values.

5. Multiple testing correction: Benjamini-Hochberg FDR correction ^{24[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
DOI} at α=0.05 across all 190 tests. 138 pairs remain significant after correction.

6. Cross-database validation: Quadruple co-occurrence searched across 10+ independent databases totaling ~31,000 deduplicated myeloid patients ^{48[48] T 2014
Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014)
PubMedDOI}. Zero matches found.

7. AI/ML scoring: ESM-2 ^{18[18] Z 2023
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023)
PubMedDOI} masked marginal log-likelihood ratios for variant pathogenicity. AlphaMissense ^{19[19] J 2023
Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023)
PubMedDOI} for proteome-wide missense effect prediction. ESMFold for structural prediction. AutoDock Vina ^{32[32] O 2010
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010)
PubMedDOI} and DiffDock ^{33[33] G 2023
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023)
DOI} for molecular docking. Boltz-1 ^{34[34] J 2024
Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024)
DOI} for protein-ligand co-folding on A100 GPU.

8. Clinical annotation: OncoKB ^{35[35] D 2017
OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017)
PubMedDOI} (v4 API) for clinical actionability levels. CIViC ^{36[36] M 2017
CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017)
PubMedDOI} GraphQL for evidence-based variant annotation. ClinGen ^{37[37] HL 2015
ClinGen — The Clinical Genome Resource. N Engl J Med (2015)
PubMedDOI} for gene-disease validity classification. PharmGKB ^{39[39] M 2021
An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021)
PubMedDOI} for pharmacogenomic annotations and FDA drug labels.

9. Population frequency: gnomAD v4.1 ^{40[40] S 2024
A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024)
PubMedDOI} GraphQL queries for population allele frequencies across all five patient variants, confirming ultra-rare status.

10. Extended pathogenicity: EVE ^{41[41] J 2021
Disease variant prediction with deep generative models of evolutionary data. Nature (2021)
PubMedDOI} (evolutionary model of variant effect) for unsupervised pathogenicity scores. SpliceAI ^{42[42] K 2019
Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019)
PubMedDOI} for cryptic splice site prediction. MaveDB ^{43[43] D 2019
MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019)
PubMedDOI} deep mutational scanning data for DNMT3A functional evidence.

11. Network biology: STRING v12.0 ^{44[44] D 2023
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023)
PubMedDOI} protein-protein interaction network for pathway connectivity. SynLethDB 2.0 ^{45[45] J 2022
SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022)
PubMedDOI} for synthetic lethality predictions. DepMap ^{46[46] A 2017
Defining a Cancer Dependency Map. Cell (2017)
PubMedDOI} / GDSC portal for drug sensitivity in cell lines.

12. Clonal architecture: PyClone-VI ^{16[16] S 2020
PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020)
PubMedDOI} for clonal tree reconstruction from variant allele frequencies, modeling subclonal structure across the five mutations.

13. ACMG evidence aggregation: Bayesian point-based classification following the standards of Richards et al. ^{25[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
PubMedDOI} and the revised quantitative framework of Tavtigian et al. ^{26[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
PubMedDOI}, integrating evidence from 14 sources (ESM-2, CADD, REVEL, AlphaMissense, EVE, SpliceAI, ClinVar, ClinGen, OncoKB, CIViC, gnomAD, MaveDB, conservation, functional studies) into final pathogenicity calls.

Data Sources

Source	Type	Patients	Access
GENIE v19.0 ^{17[17] Consortium 2017 AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017) PubMedDOI}	Sequencing panel	27,585 myeloid	Synapse (DUA)
GDC/TCGA	WES/WGS	16,411	Open access
cBioPortal	Aggregated	25,873	Open access
IPSS-M ^{13[13] E 2022 Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022) PubMedDOI}	Targeted panel	2,957	Published (Bernard 2022)
ICGC/PCAWG	WGS	1,575	DACO
ClinVar	Curated variants	N/A	Open access
Open Targets	Drug-gene	N/A	Open access
OncoKB ^{35[35] D 2017 OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017) PubMedDOI}	Clinical annotation	5 variants	API (requires token)
gnomAD v4 ^{40[40] S 2024 A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024) PubMedDOI}	Population frequency	5 variants	Open access
MaveDB ^{43[43] D 2019 MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019) PubMedDOI}	Functional data	DNMT3A DMS	Open access

Source: AACR Project GENIE Consortium (2024). Synapse syn53210. Bernard et al., NEJM 2022. GDC Data Portal v2. cBioPortal for Cancer Genomics. ICGC/PCAWG Consortium.

Computational Tools

Tool	Version	Purpose	Hardware
ESM-2 ^{18[18] Z 2023 Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023) PubMedDOI}	esm2_t33_650M_UR50D	Variant pathogenicity scoring	RTX 4060 (8GB)
ESMFold ^{18[18] Z 2023 Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023) PubMedDOI}	Via ESM-2	Structure prediction + contact maps	RTX 4060
AutoDock Vina ^{32[32] O 2010 AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010) PubMedDOI}	1.2.x	Molecular docking	CPU
DiffDock ^{33[33] G 2023 DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023) DOI}	NVIDIA NIM API	ML-based docking	Cloud (NVIDIA)
Boltz-1 ^{34[34] J 2024 Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024) DOI}	Latest	Protein-ligand co-folding	Vast.ai A100 (40GB)
Gemini	2.5 Pro / 3.1 Pro	Literature synthesis + clinical interpretation	Google API
TxGemma	Flash	Drug-target predictions	Google API
OncoKB ^{35[35] D 2017 OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017) PubMedDOI}	v4 API	Clinical actionability annotation	API
CIViC ^{36[36] M 2017 CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017) PubMedDOI}	GraphQL API	Evidence-based variant annotation	API
ClinGen ^{37[37] HL 2015 ClinGen — The Clinical Genome Resource. N Engl J Med (2015) PubMedDOI}	API	Gene-disease validity classification	API
DGIdb ^{38[38] SL 2021 Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021) PubMedDOI}	v5 API	Drug-gene interactions (76 found)	API
PharmGKB ^{39[39] M 2021 An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021) PubMedDOI}	API	Pharmacogenomic annotations + FDA labels	API
gnomAD ^{40[40] S 2024 A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024) PubMedDOI}	v4.1 GraphQL	Population allele frequencies	API
EVE ^{41[41] J 2021 Disease variant prediction with deep generative models of evolutionary data. Nature (2021) PubMedDOI}	Precomputed	Evolutionary variant pathogenicity	Downloaded scores
SpliceAI ^{42[42] K 2019 Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019) PubMedDOI}	Precomputed	Cryptic splice site prediction	Downloaded scores
AlphaMissense ^{19[19] J 2023 Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023) PubMedDOI}	Precomputed	Missense pathogenicity prediction	Downloaded scores
CADD ^{27[27] P 2019 CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019) PubMedDOI}	v1.7 (GRCh38)	Combined annotation-dependent deleteriousness	Precomputed scores
REVEL ^{28[28] NM 2016 REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016) PubMedDOI}	Precomputed	Rare missense variant pathogenicity ensemble	Downloaded scores
PyClone-VI ^{16[16] S 2020 PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020) PubMedDOI}	0.3+	Clonal tree reconstruction	CPU
DepMap/GDSC ^{46[46] A 2017 Defining a Cancer Dependency Map. Cell (2017) PubMedDOI}	Portal	Drug sensitivity in cell lines	Web API
SynLethDB ^{45[45] J 2022 SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022) PubMedDOI}	2.0	Synthetic lethality predictions	API
MaveDB ^{43[43] D 2019 MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019) PubMedDOI}	API	Deep mutational scanning data (DNMT3A)	API
STRING ^{44[44] D 2023 The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023) PubMedDOI}	v12.0	Protein-protein interactions	API
Python	3.12	Analysis pipeline	Local

Source: Lin et al. (2023) ESM-2. Corso et al. (2023) DiffDock. Wohlwend et al. (2024) Boltz-1. Trott & Olson (2010) AutoDock Vina. Google DeepMind (2024) Gemini, TxGemma. Rentzsch et al. (2019) CADD. Ioannidis et al. (2016) REVEL. Cheng et al. (2023) AlphaMissense. Chakravarty et al. (2017) OncoKB. Griffith et al. (2017) CIViC. Rehm et al. (2015) ClinGen. Freshour et al. (2021) DGIdb. Whirl-Carrillo et al. (2021) PharmGKB. Chen et al. (2024) gnomAD v4. Frazer et al. (2021) EVE. Jaganathan et al. (2019) SpliceAI. Gillis et al. (2021) PyClone-VI. Szklarczyk et al. (2023) STRING v12. Tsherniak et al. (2017) DepMap. Wang et al. (2022) SynLethDB 2.0. Esposito et al. (2019) MaveDB.

Statistical Methods

Fisher's exact test (two-sided) is used for all co-occurrence testing ^{31[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
PubMedDOI}, the appropriate test for 2×2 contingency tables with small expected cell counts, avoiding the asymptotic assumptions of chi-squared tests.

Benjamini-Hochberg FDR correction ^{24[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
DOI} is applied at α=0.05 across all 190 gene-pair tests to control the false discovery rate. Of 190 pairs tested, 138 remain significant after correction.

Observed/Expected ratio is calculated with panel-adjusted denominators. For each gene pair, the expected co-occurrence frequency is the product of individual mutation frequencies among panel-eligible patients. Log2(O/E) is used for heatmap visualization to symmetrize enrichment and depletion.

Mutual exclusivity testing uses a maximum entropy model ^{30[30] S 2016
A novel independence test for somatic alterations in cancer shows that biology drives mutual exclusivity but chance explains most co-occurrence. Genome Biol (2016)
PubMedDOI} to distinguish biological mutual exclusivity from chance co-occurrence patterns, accounting for the marginal mutation frequencies of each gene.

Quadruple expected frequency is calculated under a statistical independence model: the product of the four individual gene mutation frequencies yields an expected frequency of ~0.000113 per patient, or ~0.004 expected matches in 31,000 patients. This is consistent with the observed zero but does not account for potential biological dependencies between the mutations.

ACMG variant classification follows the standards of Richards et al. (2015) ^{25[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
PubMedDOI} with the quantitative Bayesian framework of Tavtigian et al. (2020) ^{26[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
PubMedDOI}. Evidence codes (PP3, PM1, PM2, PS3, etc.) are aggregated from 14 computational and clinical sources into a final pathogenicity call.

Source: Fisher (1922). Benjamini & Hochberg (1995). Panel adjustment methodology follows AACR GENIE recommendations.

Reproducibility

All analysis scripts are available in the project repository. The 13-step pipeline spans 25+ computational tools and 31 AI research scripts. The master pipeline (mutation_profile/scripts/run_all.py) orchestrates execution with automated verification against known reference values at each stage.

Raw GENIE data requires a signed Synapse Data Use Agreement ^{17[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
PubMedDOI} (syn53210). Cross-database queries use public APIs (GDC, cBioPortal, ClinVar via NCBI E-utilities, Open Targets GraphQL, OncoKB, CIViC, ClinGen, DGIdb, PharmGKB, gnomAD, STRING, SynLethDB, MaveDB). ICGC/PCAWG requires DACO access. OncoKB requires an API token.

Software requirements: Python 3.12, pandas, scipy, plotly, torch, fair-esm, openbabel, meeko, vina, requests, pyclone-vi, networkx. GPU workloads (ESM-2, ESMFold) require CUDA-capable GPU with ≥8 GB VRAM. Boltz-1 requires ≥40 GB VRAM (A100).

Every number in this portal is traceable to a specific script and data file. The verification script (mutation_profile/scripts/verify_results.py) checks all key metrics against hardcoded reference values and fails loudly on any regression.

Total pipeline runtime: ~5 minutes on a local workstation (excluding Boltz-1 GPU runs on Vast.ai, which add ~15 minutes per protein-ligand pair on an A100).

Source: Full source code: mutation_profile/scripts/ (44 core + 31 AI research). Verification: mutation_profile/scripts/verify_results.py.

Data Downloads

All data files, methodology documentation, and citation references are available for download.

@misc{roine2026_mutation_profile,
  author       = {Røine, Henrik G.},
  title        = {Quintuple Somatic Driver Mutation Profile in MDS-AML:
                  Co-occurrence Analysis Across 31,000+ Myeloid Patients},
  year         = {2026},
  note         = {N=1 case study. AACR Project GENIE v19.0,
                  cBioPortal, GDC, ICGC, ClinVar. Zero matches
                  across 10+ databases. Pairwise-corrected
                  expected frequency: 1 in 1.9 billion.},
  howpublished = {Genomics Portal},
}

TY  - DATA
AU  - Røine, Henrik G.
TI  - Quintuple Somatic Driver Mutation Profile in MDS-AML: Co-occurrence Analysis Across 31,000+ Myeloid Patients
PY  - 2026
N1  - N=1 case study. AACR Project GENIE v19.0, cBioPortal, GDC, ICGC, ClinVar. Zero matches across 10+ databases. Pairwise-corrected expected frequency: 1 in 1.9 billion.
DB  - Genomics Portal
ER  -

References

AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017). DOI
Papaemmanuil E, Gerstung M, Bullinger L, et al. Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016). PubMed
Kandoth C, McLellan MD, Vandin F, et al. Mutational landscape and significance across 12 major cancer types. Nature (2013). PubMed
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995). DOI
Haferlach T, Nagata Y, Grossmann V, et al. Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014). PubMed
Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023). DOI
Cheng J, Novati G, Pan J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023). DOI
Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function. J Comput Chem (2010). PubMed
Corso G, Staerk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023). DOI
Wohlwend J, Corso G, Passaro S, et al. Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024). DOI
Chakravarty D, Gao J, Phillips SM, et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017). PubMed
Griffith M, Spies NC, Krysiak K, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017). PubMed
Rehm HL, Berg JS, Brooks LD, et al. ClinGen: The Clinical Genome Resource. N Engl J Med (2015). PubMed
Freshour SL, Kiwala S, Cotto KC, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021). PubMed
Whirl-Carrillo M, Huddart R, Gong L, et al. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021). PubMed
Chen S, Francioli LC, Goodrich JK, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024). PubMed
Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature (2021). PubMed
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019). PubMed
Esposito D, Weile J, Shendure J, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019). PubMed
Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks. Nucleic Acids Res (2023). PubMed
Wang J, Wu M, Huang X, et al. SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality. Database (2022). PubMed
Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a Cancer Dependency Map. Cell (2017). PubMed
Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020). DOI
Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants. Genet Med (2015). PubMed
Tavtigian SV, Harrison SM, Boucher KM, Biesecker LG. Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020). PubMed
Canisius S, Martens JW, Wessels LF. A novel independence test for somatic alterations in cancer. Genome Biol (2016). DOI
Bernard E, Tuechler H, Greenberg PL, et al. Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022). DOI
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019). PubMed
Ioannidis NM, Rothstein JH, Pejaver V, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016). PubMed

[1] AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017). DOI

[2] Papaemmanuil E, Gerstung M, Bullinger L, et al. Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016). PubMed

[3] Kandoth C, McLellan MD, Vandin F, et al. Mutational landscape and significance across 12 major cancer types. Nature (2013). PubMed

[4] Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995). DOI

[5] Haferlach T, Nagata Y, Grossmann V, et al. Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014). PubMed

[6] Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023). DOI

[7] Cheng J, Novati G, Pan J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023). DOI

[8] Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function. J Comput Chem (2010). PubMed

[9] Corso G, Staerk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023). DOI

[10] Wohlwend J, Corso G, Passaro S, et al. Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024). DOI

[11] Chakravarty D, Gao J, Phillips SM, et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017). PubMed

[12] Griffith M, Spies NC, Krysiak K, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017). PubMed

[13] Rehm HL, Berg JS, Brooks LD, et al. ClinGen: The Clinical Genome Resource. N Engl J Med (2015). PubMed

[14] Freshour SL, Kiwala S, Cotto KC, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021). PubMed

[15] Whirl-Carrillo M, Huddart R, Gong L, et al. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021). PubMed

[16] Chen S, Francioli LC, Goodrich JK, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024). PubMed

[17] Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature (2021). PubMed

[18] Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019). PubMed

[19] Esposito D, Weile J, Shendure J, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019). PubMed

[20] Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks. Nucleic Acids Res (2023). PubMed

[21] Wang J, Wu M, Huang X, et al. SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality. Database (2022). PubMed

[22] Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a Cancer Dependency Map. Cell (2017). PubMed

[23] Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020). DOI

[24] Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants. Genet Med (2015). PubMed

[25] Tavtigian SV, Harrison SM, Boucher KM, Biesecker LG. Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020). PubMed

[26] Canisius S, Martens JW, Wessels LF. A novel independence test for somatic alterations in cancer. Genome Biol (2016). DOI

[27] Bernard E, Tuechler H, Greenberg PL, et al. Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022). DOI

[28] Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019). PubMed

[29] Ioannidis NM, Rothstein JH, Pejaver V, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016). PubMed