Skip to main content
AACR Project GENIE v19.0 · 21,017 myeloid patients Panel-adjusted Fisher's exact with Benjamini-Hochberg FDR N=1 case study · Not clinical guidance

Methods & Reproducibility

Pipeline architecture, data provenance, statistical methods, computational tools

Pipeline Architecture

The analysis pipeline proceeds in thirteen sequential stages, each producing traceable intermediate outputs:
1. Data acquisition: GENIE v19.0 17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
downloaded from Synapse under a signed Data Use Agreement (syn53210). Raw mutation, clinical, and panel data ingested via genie_loader.py.
2. Filtering: Restricted to myeloid OncoTree codes (AML, MDS, MPN, CMML, MDS/MPN, JMML) 12
[12] E 2016
Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016)
. Coding variants only: Intron, Silent, UTR, Flank, IGR, RNA, and Splice_Region excluded. Hypermutation filter removes samples with >20 coding mutations in the 34 target genes.
3. Panel adjustment: For each gene pair, only patients whose sequencing panel covers both genes are included in the denominator. This corrects for heterogeneous panel designs across 40+ contributing centers 17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
.
4. Co-occurrence testing: Fisher's exact test (two-sided) applied to all 190 pairwise combinations of the top 20 mutated myeloid genes, following the approach of Kandoth et al. 31
[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
. Produces observed/expected ratios and raw p-values.
5. Multiple testing correction: Benjamini-Hochberg FDR correction 24
[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
at α=0.05 across all 190 tests. 138 pairs remain significant after correction.
6. Cross-database validation: Quadruple co-occurrence searched across 10+ independent databases totaling ~31,000 deduplicated myeloid patients 48
[48] T 2014
Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014)
. Zero matches found.
7. AI/ML scoring: ESM-2 18
[18] Z 2023
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023)
masked marginal log-likelihood ratios for variant pathogenicity. AlphaMissense 19
[19] J 2023
Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023)
for proteome-wide missense effect prediction. ESMFold for structural prediction. AutoDock Vina 32
[32] O 2010
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010)
and DiffDock 33
[33] G 2023
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023)
for molecular docking. Boltz-1 34
[34] J 2024
Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024)
for protein-ligand co-folding on A100 GPU.
8. Clinical annotation: OncoKB 35
[35] D 2017
OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017)
(v4 API) for clinical actionability levels. CIViC 36
[36] M 2017
CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017)
GraphQL for evidence-based variant annotation. ClinGen 37
[37] HL 2015
ClinGen — The Clinical Genome Resource. N Engl J Med (2015)
for gene-disease validity classification. PharmGKB 39
[39] M 2021
An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021)
for pharmacogenomic annotations and FDA drug labels.
9. Population frequency: gnomAD v4.1 40
[40] S 2024
A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024)
GraphQL queries for population allele frequencies across all five patient variants, confirming ultra-rare status.
10. Extended pathogenicity: EVE 41
[41] J 2021
Disease variant prediction with deep generative models of evolutionary data. Nature (2021)
(evolutionary model of variant effect) for unsupervised pathogenicity scores. SpliceAI 42
[42] K 2019
Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019)
for cryptic splice site prediction. MaveDB 43
[43] D 2019
MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019)
deep mutational scanning data for DNMT3A functional evidence.
11. Network biology: STRING v12.0 44
[44] D 2023
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023)
protein-protein interaction network for pathway connectivity. SynLethDB 2.0 45
[45] J 2022
SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022)
for synthetic lethality predictions. DepMap 46
[46] A 2017
Defining a Cancer Dependency Map. Cell (2017)
/ GDSC portal for drug sensitivity in cell lines.
12. Clonal architecture: PyClone-VI 16
[16] S 2020
PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020)
for clonal tree reconstruction from variant allele frequencies, modeling subclonal structure across the five mutations.
13. ACMG evidence aggregation: Bayesian point-based classification following the standards of Richards et al. 25
[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
and the revised quantitative framework of Tavtigian et al. 26
[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
, integrating evidence from 14 sources (ESM-2, CADD, REVEL, AlphaMissense, EVE, SpliceAI, ClinVar, ClinGen, OncoKB, CIViC, gnomAD, MaveDB, conservation, functional studies) into final pathogenicity calls.

Data Sources

Source Type Patients Access
GENIE v19.0 17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
Sequencing panel 27,585 myeloid Synapse (DUA)
GDC/TCGA WES/WGS 16,411 Open access
cBioPortal Aggregated 25,873 Open access
IPSS-M 13
[13] E 2022
Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022)
Targeted panel 2,957 Published (Bernard 2022)
ICGC/PCAWG WGS 1,575 DACO
ClinVar Curated variants N/A Open access
Open Targets Drug-gene N/A Open access
OncoKB 35
[35] D 2017
OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017)
Clinical annotation 5 variants API (requires token)
gnomAD v4 40
[40] S 2024
A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024)
Population frequency 5 variants Open access
MaveDB 43
[43] D 2019
MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019)
Functional data DNMT3A DMS Open access
Source: AACR Project GENIE Consortium (2024). Synapse syn53210. Bernard et al., NEJM 2022. GDC Data Portal v2. cBioPortal for Cancer Genomics. ICGC/PCAWG Consortium.

Computational Tools

Tool Version Purpose Hardware
ESM-2 18
[18] Z 2023
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023)
esm2_t33_650M_UR50D Variant pathogenicity scoring RTX 4060 (8GB)
ESMFold 18
[18] Z 2023
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023)
Via ESM-2 Structure prediction + contact maps RTX 4060
AutoDock Vina 32
[32] O 2010
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010)
1.2.x Molecular docking CPU
DiffDock 33
[33] G 2023
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023)
NVIDIA NIM API ML-based docking Cloud (NVIDIA)
Boltz-1 34
[34] J 2024
Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024)
Latest Protein-ligand co-folding Vast.ai A100 (40GB)
Gemini 2.5 Pro / 3.1 Pro Literature synthesis + clinical interpretation Google API
TxGemma Flash Drug-target predictions Google API
OncoKB 35
[35] D 2017
OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017)
v4 API Clinical actionability annotation API
CIViC 36
[36] M 2017
CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017)
GraphQL API Evidence-based variant annotation API
ClinGen 37
[37] HL 2015
ClinGen — The Clinical Genome Resource. N Engl J Med (2015)
API Gene-disease validity classification API
DGIdb 38
[38] SL 2021
Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021)
v5 API Drug-gene interactions (76 found) API
PharmGKB 39
[39] M 2021
An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021)
API Pharmacogenomic annotations + FDA labels API
gnomAD 40
[40] S 2024
A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024)
v4.1 GraphQL Population allele frequencies API
EVE 41
[41] J 2021
Disease variant prediction with deep generative models of evolutionary data. Nature (2021)
Precomputed Evolutionary variant pathogenicity Downloaded scores
SpliceAI 42
[42] K 2019
Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019)
Precomputed Cryptic splice site prediction Downloaded scores
AlphaMissense 19
[19] J 2023
Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023)
Precomputed Missense pathogenicity prediction Downloaded scores
CADD 27
[27] P 2019
CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019)
v1.7 (GRCh38) Combined annotation-dependent deleteriousness Precomputed scores
REVEL 28
[28] NM 2016
REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016)
Precomputed Rare missense variant pathogenicity ensemble Downloaded scores
PyClone-VI 16
[16] S 2020
PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020)
0.3+ Clonal tree reconstruction CPU
DepMap/GDSC 46
[46] A 2017
Defining a Cancer Dependency Map. Cell (2017)
Portal Drug sensitivity in cell lines Web API
SynLethDB 45
[45] J 2022
SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022)
2.0 Synthetic lethality predictions API
MaveDB 43
[43] D 2019
MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019)
API Deep mutational scanning data (DNMT3A) API
STRING 44
[44] D 2023
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023)
v12.0 Protein-protein interactions API
Python 3.12 Analysis pipeline Local
Source: Lin et al. (2023) ESM-2. Corso et al. (2023) DiffDock. Wohlwend et al. (2024) Boltz-1. Trott & Olson (2010) AutoDock Vina. Google DeepMind (2024) Gemini, TxGemma. Rentzsch et al. (2019) CADD. Ioannidis et al. (2016) REVEL. Cheng et al. (2023) AlphaMissense. Chakravarty et al. (2017) OncoKB. Griffith et al. (2017) CIViC. Rehm et al. (2015) ClinGen. Freshour et al. (2021) DGIdb. Whirl-Carrillo et al. (2021) PharmGKB. Chen et al. (2024) gnomAD v4. Frazer et al. (2021) EVE. Jaganathan et al. (2019) SpliceAI. Gillis et al. (2021) PyClone-VI. Szklarczyk et al. (2023) STRING v12. Tsherniak et al. (2017) DepMap. Wang et al. (2022) SynLethDB 2.0. Esposito et al. (2019) MaveDB.

Statistical Methods

Fisher's exact test (two-sided) is used for all co-occurrence testing 31
[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
, the appropriate test for 2×2 contingency tables with small expected cell counts, avoiding the asymptotic assumptions of chi-squared tests.
Benjamini-Hochberg FDR correction 24
[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
is applied at α=0.05 across all 190 gene-pair tests to control the false discovery rate. Of 190 pairs tested, 138 remain significant after correction.
Observed/Expected ratio is calculated with panel-adjusted denominators. For each gene pair, the expected co-occurrence frequency is the product of individual mutation frequencies among panel-eligible patients. Log2(O/E) is used for heatmap visualization to symmetrize enrichment and depletion.
Mutual exclusivity testing uses a maximum entropy model 30
[30] S 2016
A novel independence test for somatic alterations in cancer shows that biology drives mutual exclusivity but chance explains most co-occurrence. Genome Biol (2016)
to distinguish biological mutual exclusivity from chance co-occurrence patterns, accounting for the marginal mutation frequencies of each gene.
Quadruple expected frequency is calculated under a statistical independence model: the product of the four individual gene mutation frequencies yields an expected frequency of ~0.000113 per patient, or ~0.004 expected matches in 31,000 patients. This is consistent with the observed zero but does not account for potential biological dependencies between the mutations.
ACMG variant classification follows the standards of Richards et al. (2015) 25
[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
with the quantitative Bayesian framework of Tavtigian et al. (2020) 26
[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
. Evidence codes (PP3, PM1, PM2, PS3, etc.) are aggregated from 14 computational and clinical sources into a final pathogenicity call.
Source: Fisher (1922). Benjamini & Hochberg (1995). Panel adjustment methodology follows AACR GENIE recommendations.

Reproducibility

All analysis scripts are available in the project repository. The 13-step pipeline spans 25+ computational tools and 31 AI research scripts. The master pipeline (mutation_profile/scripts/run_all.py) orchestrates execution with automated verification against known reference values at each stage.
Raw GENIE data requires a signed Synapse Data Use Agreement 17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
(syn53210). Cross-database queries use public APIs (GDC, cBioPortal, ClinVar via NCBI E-utilities, Open Targets GraphQL, OncoKB, CIViC, ClinGen, DGIdb, PharmGKB, gnomAD, STRING, SynLethDB, MaveDB). ICGC/PCAWG requires DACO access. OncoKB requires an API token.
Software requirements: Python 3.12, pandas, scipy, plotly, torch, fair-esm, openbabel, meeko, vina, requests, pyclone-vi, networkx. GPU workloads (ESM-2, ESMFold) require CUDA-capable GPU with ≥8 GB VRAM. Boltz-1 requires ≥40 GB VRAM (A100).
Every number in this portal is traceable to a specific script and data file. The verification script (mutation_profile/scripts/verify_results.py) checks all key metrics against hardcoded reference values and fails loudly on any regression.
Total pipeline runtime: ~5 minutes on a local workstation (excluding Boltz-1 GPU runs on Vast.ai, which add ~15 minutes per protein-ligand pair on an A100).
Source: Full source code: mutation_profile/scripts/ (44 core + 31 AI research). Verification: mutation_profile/scripts/verify_results.py.

Data Downloads

All data files, methodology documentation, and citation references are available for download.

Citation Export
@misc{roine2026_mutation_profile,
  author       = {Røine, Henrik G.},
  title        = {Quintuple Somatic Driver Mutation Profile in MDS-AML:
                  Co-occurrence Analysis Across 31,000+ Myeloid Patients},
  year         = {2026},
  note         = {N=1 case study. AACR Project GENIE v19.0,
                  cBioPortal, GDC, ICGC, ClinVar. Zero matches
                  across 10+ databases. Pairwise-corrected
                  expected frequency: 1 in 1.9 billion.},
  howpublished = {Genomics Portal},
}
References
  1. AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017). DOI
  2. Papaemmanuil E, Gerstung M, Bullinger L, et al. Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016). PubMed
  3. Kandoth C, McLellan MD, Vandin F, et al. Mutational landscape and significance across 12 major cancer types. Nature (2013). PubMed
  4. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995). DOI
  5. Haferlach T, Nagata Y, Grossmann V, et al. Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014). PubMed
  6. Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023). DOI
  7. Cheng J, Novati G, Pan J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023). DOI
  8. Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function. J Comput Chem (2010). PubMed
  9. Corso G, Staerk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023). DOI
  10. Wohlwend J, Corso G, Passaro S, et al. Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024). DOI
  11. Chakravarty D, Gao J, Phillips SM, et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017). PubMed
  12. Griffith M, Spies NC, Krysiak K, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017). PubMed
  13. Rehm HL, Berg JS, Brooks LD, et al. ClinGen: The Clinical Genome Resource. N Engl J Med (2015). PubMed
  14. Freshour SL, Kiwala S, Cotto KC, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021). PubMed
  15. Whirl-Carrillo M, Huddart R, Gong L, et al. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021). PubMed
  16. Chen S, Francioli LC, Goodrich JK, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024). PubMed
  17. Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature (2021). PubMed
  18. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019). PubMed
  19. Esposito D, Weile J, Shendure J, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019). PubMed
  20. Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks. Nucleic Acids Res (2023). PubMed
  21. Wang J, Wu M, Huang X, et al. SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality. Database (2022). PubMed
  22. Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a Cancer Dependency Map. Cell (2017). PubMed
  23. Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020). DOI
  24. Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants. Genet Med (2015). PubMed
  25. Tavtigian SV, Harrison SM, Boucher KM, Biesecker LG. Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020). PubMed
  26. Canisius S, Martens JW, Wessels LF. A novel independence test for somatic alterations in cancer. Genome Biol (2016). DOI
  27. Bernard E, Tuechler H, Greenberg PL, et al. Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022). DOI
  28. Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019). PubMed
  29. Ioannidis NM, Rothstein JH, Pejaver V, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016). PubMed