Methods & Reproducibility
Pipeline architecture, data provenance, statistical methods, computational tools
Pipeline Architecture
The analysis pipeline proceeds in thirteen sequential stages, each
producing traceable intermediate outputs:
1. Data acquisition: GENIE v19.0
17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
downloaded from Synapse under a signed
Data Use Agreement (syn53210). Raw mutation, clinical, and panel data
ingested via genie_loader.py.
2. Filtering: Restricted to myeloid OncoTree
codes (AML, MDS, MPN, CMML, MDS/MPN, JMML)
12
[12] E 2016
Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016)
. Coding variants only: Intron,
Silent, UTR, Flank, IGR, RNA, and Splice_Region excluded. Hypermutation
filter removes samples with >20 coding mutations in the 34 target
genes.
3. Panel adjustment: For each gene pair, only
patients whose sequencing panel covers both genes are included
in the denominator. This corrects for heterogeneous panel designs across
40+ contributing centers
17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
.
4. Co-occurrence testing: Fisher's exact test
(two-sided) applied to all 190 pairwise combinations of the top 20
mutated myeloid genes, following the approach of Kandoth et al.
31
[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
. Produces observed/expected ratios
and raw p-values.
5. Multiple testing correction:
Benjamini-Hochberg FDR correction
24
[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
at α=0.05 across all 190
tests. 138 pairs remain significant after correction.
6. Cross-database validation: Quadruple
co-occurrence searched across 10+ independent databases totaling ~31,000
deduplicated myeloid patients
48
[48] T 2014
Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014)
. Zero matches found.
7. AI/ML scoring: ESM-2
18
[18] Z 2023
Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023)
masked marginal log-likelihood ratios
for variant pathogenicity. AlphaMissense
19[19] J 2023
Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023)
for proteome-wide missense effect
prediction. ESMFold for structural prediction. AutoDock Vina
32[32] O 2010
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010)
and DiffDock
33[33] G 2023
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023)
for molecular docking. Boltz-1
34[34] J 2024
Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024)
for protein-ligand co-folding on
A100 GPU.
8. Clinical annotation: OncoKB
35
[35] D 2017
OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017)
(v4 API) for clinical
actionability levels. CIViC
36[36] M 2017
CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017)
GraphQL for evidence-based variant
annotation. ClinGen
37[37] HL 2015
ClinGen — The Clinical Genome Resource. N Engl J Med (2015)
for gene-disease validity
classification. PharmGKB
39[39] M 2021
An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021)
for pharmacogenomic
annotations and FDA drug labels.
9. Population frequency: gnomAD v4.1
40
[40] S 2024
A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024)
GraphQL queries for population allele
frequencies across all five patient variants, confirming ultra-rare
status.
10. Extended pathogenicity: EVE
41
[41] J 2021
Disease variant prediction with deep generative models of evolutionary data. Nature (2021)
(evolutionary model of variant
effect) for unsupervised pathogenicity scores. SpliceAI
42[42] K 2019
Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019)
for cryptic splice site
prediction. MaveDB
43[43] D 2019
MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019)
deep mutational scanning data for
DNMT3A functional evidence.
11. Network biology: STRING v12.0
44
[44] D 2023
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023)
protein-protein interaction
network for pathway connectivity. SynLethDB 2.0
45[45] J 2022
SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022)
for synthetic lethality predictions.
DepMap
46[46] A 2017
Defining a Cancer Dependency Map. Cell (2017)
/ GDSC portal for drug
sensitivity in cell lines.
12. Clonal architecture: PyClone-VI
16
[16] S 2020
PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020)
for clonal tree reconstruction from
variant allele frequencies, modeling subclonal structure across the five
mutations.
13. ACMG evidence aggregation: Bayesian
point-based classification following the standards of Richards et al.
25
[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
and the revised quantitative
framework of Tavtigian et al.
26[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
, integrating evidence from 14
sources (ESM-2, CADD, REVEL, AlphaMissense, EVE, SpliceAI, ClinVar,
ClinGen, OncoKB, CIViC, gnomAD, MaveDB, conservation, functional
studies) into final pathogenicity calls.
Data Sources
| Source | Type | Patients | Access |
|---|---|---|---|
| GENIE v19.0 17 [17] Consortium 2017 AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017) | Sequencing panel | 27,585 myeloid | Synapse (DUA) |
| GDC/TCGA | WES/WGS | 16,411 | Open access |
| cBioPortal | Aggregated | 25,873 | Open access |
| IPSS-M 13 [13] E 2022 Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022) | Targeted panel | 2,957 | Published (Bernard 2022) |
| ICGC/PCAWG | WGS | 1,575 | DACO |
| ClinVar | Curated variants | N/A | Open access |
| Open Targets | Drug-gene | N/A | Open access |
| OncoKB 35 [35] D 2017 OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017) | Clinical annotation | 5 variants | API (requires token) |
| gnomAD v4 40 [40] S 2024 A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024) | Population frequency | 5 variants | Open access |
| MaveDB 43 [43] D 2019 MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019) | Functional data | DNMT3A DMS | Open access |
Source: AACR Project GENIE Consortium
(2024). Synapse syn53210. Bernard et al., NEJM 2022. GDC Data
Portal v2. cBioPortal for Cancer Genomics. ICGC/PCAWG Consortium.
Computational Tools
| Tool | Version | Purpose | Hardware |
|---|---|---|---|
| ESM-2 18 [18] Z 2023 Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023) | esm2_t33_650M_UR50D | Variant pathogenicity scoring | RTX 4060 (8GB) |
| ESMFold 18 [18] Z 2023 Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023) | Via ESM-2 | Structure prediction + contact maps | RTX 4060 |
| AutoDock Vina 32 [32] O 2010 AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem (2010) | 1.2.x | Molecular docking | CPU |
| DiffDock 33 [33] G 2023 DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023) | NVIDIA NIM API | ML-based docking | Cloud (NVIDIA) |
| Boltz-1 34 [34] J 2024 Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024) | Latest | Protein-ligand co-folding | Vast.ai A100 (40GB) |
| Gemini | 2.5 Pro / 3.1 Pro | Literature synthesis + clinical interpretation | Google API |
| TxGemma | Flash | Drug-target predictions | Google API |
| OncoKB 35 [35] D 2017 OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017) | v4 API | Clinical actionability annotation | API |
| CIViC 36 [36] M 2017 CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017) | GraphQL API | Evidence-based variant annotation | API |
| ClinGen 37 [37] HL 2015 ClinGen — The Clinical Genome Resource. N Engl J Med (2015) | API | Gene-disease validity classification | API |
| DGIdb 38 [38] SL 2021 Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021) | v5 API | Drug-gene interactions (76 found) | API |
| PharmGKB 39 [39] M 2021 An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021) | API | Pharmacogenomic annotations + FDA labels | API |
| gnomAD 40 [40] S 2024 A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024) | v4.1 GraphQL | Population allele frequencies | API |
| EVE 41 [41] J 2021 Disease variant prediction with deep generative models of evolutionary data. Nature (2021) | Precomputed | Evolutionary variant pathogenicity | Downloaded scores |
| SpliceAI 42 [42] K 2019 Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019) | Precomputed | Cryptic splice site prediction | Downloaded scores |
| AlphaMissense 19 [19] J 2023 Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023) | Precomputed | Missense pathogenicity prediction | Downloaded scores |
| CADD 27 [27] P 2019 CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019) | v1.7 (GRCh38) | Combined annotation-dependent deleteriousness | Precomputed scores |
| REVEL 28 [28] NM 2016 REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016) | Precomputed | Rare missense variant pathogenicity ensemble | Downloaded scores |
| PyClone-VI 16 [16] S 2020 PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020) | 0.3+ | Clonal tree reconstruction | CPU |
| DepMap/GDSC 46 [46] A 2017 Defining a Cancer Dependency Map. Cell (2017) | Portal | Drug sensitivity in cell lines | Web API |
| SynLethDB 45 [45] J 2022 SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality for novel anticancer drug discovery. Database (2022) | 2.0 | Synthetic lethality predictions | API |
| MaveDB 43 [43] D 2019 MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019) | API | Deep mutational scanning data (DNMT3A) | API |
| STRING 44 [44] D 2023 The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res (2023) | v12.0 | Protein-protein interactions | API |
| Python | 3.12 | Analysis pipeline | Local |
Source:
Lin et al. (2023) ESM-2.
Corso et al. (2023) DiffDock.
Wohlwend et al. (2024) Boltz-1.
Trott & Olson (2010) AutoDock Vina.
Google DeepMind (2024) Gemini, TxGemma.
Rentzsch et al. (2019) CADD.
Ioannidis et al. (2016) REVEL.
Cheng et al. (2023) AlphaMissense.
Chakravarty et al. (2017) OncoKB.
Griffith et al. (2017) CIViC.
Rehm et al. (2015) ClinGen.
Freshour et al. (2021) DGIdb.
Whirl-Carrillo et al. (2021) PharmGKB.
Chen et al. (2024) gnomAD v4.
Frazer et al. (2021) EVE.
Jaganathan et al. (2019) SpliceAI.
Gillis et al. (2021) PyClone-VI.
Szklarczyk et al. (2023) STRING v12.
Tsherniak et al. (2017) DepMap.
Wang et al. (2022) SynLethDB 2.0.
Esposito et al. (2019) MaveDB.
Statistical Methods
Fisher's exact test (two-sided) is used for all
co-occurrence testing
31
[31] C 2013
Mutational landscape and significance across 12 major cancer types. Nature (2013)
, the appropriate test for
2×2 contingency tables with small expected cell counts, avoiding
the asymptotic assumptions of chi-squared tests.
Benjamini-Hochberg FDR correction 24
[24] Y 1995
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995)
is applied at α=0.05 across
all 190 gene-pair tests to control the false discovery rate. Of 190
pairs tested, 138 remain significant after correction.
Observed/Expected ratio is calculated with
panel-adjusted denominators. For each gene pair, the expected
co-occurrence frequency is the product of individual mutation
frequencies among panel-eligible patients. Log2(O/E) is used for
heatmap visualization to symmetrize enrichment and depletion.
Mutual exclusivity testing uses a maximum entropy model
30
[30] S 2016
A novel independence test for somatic alterations in cancer shows that biology drives mutual exclusivity but chance explains most co-occurrence. Genome Biol (2016)
to distinguish biological mutual
exclusivity from chance co-occurrence patterns, accounting for the
marginal mutation frequencies of each gene.
Quadruple expected frequency is calculated under a
statistical independence model: the product of the four individual gene
mutation frequencies yields an expected frequency of ~0.000113 per
patient, or ~0.004 expected matches in 31,000 patients. This is
consistent with the observed zero but does not account for potential
biological dependencies between the mutations.
ACMG variant classification follows the standards of
Richards et al. (2015)
25
[25] S 2015
Standards and guidelines for the interpretation of sequence variants. Genet Med (2015)
with the quantitative Bayesian
framework of Tavtigian et al. (2020)
26[26] SV 2020
Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020)
. Evidence codes (PP3, PM1, PM2,
PS3, etc.) are aggregated from 14 computational and clinical sources
into a final pathogenicity call.
Source: Fisher (1922). Benjamini &
Hochberg (1995). Panel adjustment methodology follows AACR GENIE
recommendations.
Reproducibility
All analysis scripts are available in the project repository. The
13-step pipeline spans 25+ computational tools and 31 AI research
scripts. The master pipeline
(
mutation_profile/scripts/run_all.py) orchestrates
execution with automated verification against known reference values at
each stage.
Raw GENIE data requires a signed Synapse Data Use Agreement
17
[17] Consortium 2017
AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017)
(syn53210). Cross-database queries use
public APIs (GDC, cBioPortal, ClinVar via NCBI E-utilities, Open
Targets GraphQL, OncoKB, CIViC, ClinGen, DGIdb, PharmGKB, gnomAD,
STRING, SynLethDB, MaveDB). ICGC/PCAWG requires DACO access. OncoKB
requires an API token.
Software requirements: Python 3.12, pandas, scipy,
plotly, torch, fair-esm, openbabel, meeko, vina, requests, pyclone-vi,
networkx. GPU workloads (ESM-2, ESMFold) require CUDA-capable GPU with
≥8 GB VRAM. Boltz-1 requires ≥40 GB VRAM (A100).
Every number in this portal is traceable to a specific script and data
file. The verification script
(
mutation_profile/scripts/verify_results.py) checks all
key metrics against hardcoded reference values and fails loudly on any
regression.
Total pipeline runtime: ~5 minutes on a local
workstation (excluding Boltz-1 GPU runs on Vast.ai, which add ~15
minutes per protein-ligand pair on an A100).
Source: Full source code:
mutation_profile/scripts/ (44 core + 31 AI research). Verification:
mutation_profile/scripts/verify_results.py.
Data Downloads
All data files, methodology documentation, and citation references are available for download.
Citation Export
@misc{roine2026_mutation_profile,
author = {Røine, Henrik G.},
title = {Quintuple Somatic Driver Mutation Profile in MDS-AML:
Co-occurrence Analysis Across 31,000+ Myeloid Patients},
year = {2026},
note = {N=1 case study. AACR Project GENIE v19.0,
cBioPortal, GDC, ICGC, ClinVar. Zero matches
across 10+ databases. Pairwise-corrected
expected frequency: 1 in 1.9 billion.},
howpublished = {Genomics Portal},
} References
- AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017). DOI
- Papaemmanuil E, Gerstung M, Bullinger L, et al. Genomic Classification and Prognosis in Acute Myeloid Leukemia. N Engl J Med (2016). PubMed
- Kandoth C, McLellan MD, Vandin F, et al. Mutational landscape and significance across 12 major cancer types. Nature (2013). PubMed
- Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B (1995). DOI
- Haferlach T, Nagata Y, Grossmann V, et al. Landscape of genetic lesions in 944 patients with myelodysplastic syndromes. Leukemia (2014). PubMed
- Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (2023). DOI
- Cheng J, Novati G, Pan J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023). DOI
- Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function. J Comput Chem (2010). PubMed
- Corso G, Staerk H, Jing B, Barzilay R, Jaakkola T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR (2023). DOI
- Wohlwend J, Corso G, Passaro S, et al. Boltz-1: Democratizing Biomolecular Interaction Modeling. bioRxiv (2024). DOI
- Chakravarty D, Gao J, Phillips SM, et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis Oncol (2017). PubMed
- Griffith M, Spies NC, Krysiak K, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nat Genet (2017). PubMed
- Rehm HL, Berg JS, Brooks LD, et al. ClinGen: The Clinical Genome Resource. N Engl J Med (2015). PubMed
- Freshour SL, Kiwala S, Cotto KC, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res (2021). PubMed
- Whirl-Carrillo M, Huddart R, Gong L, et al. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther (2021). PubMed
- Chen S, Francioli LC, Goodrich JK, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature (2024). PubMed
- Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature (2021). PubMed
- Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell (2019). PubMed
- Esposito D, Weile J, Shendure J, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol (2019). PubMed
- Szklarczyk D, Kirsch R, Koutrouli M, et al. The STRING database in 2023: protein-protein association networks. Nucleic Acids Res (2023). PubMed
- Wang J, Wu M, Huang X, et al. SynLethDB 2.0: A web-based knowledge graph database on synthetic lethality. Database (2022). PubMed
- Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a Cancer Dependency Map. Cell (2017). PubMed
- Gillis S, Roth A. PyClone-VI: scalable inference of clonal population structures using whole genome data. BMC Bioinformatics (2020). DOI
- Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants. Genet Med (2015). PubMed
- Tavtigian SV, Harrison SM, Boucher KM, Biesecker LG. Fitting a naturally scaled point system to the ACMG/AMP variant classification guidelines. Hum Mutat (2020). PubMed
- Canisius S, Martens JW, Wessels LF. A novel independence test for somatic alterations in cancer. Genome Biol (2016). DOI
- Bernard E, Tuechler H, Greenberg PL, et al. Molecular International Prognostic Scoring System for Myelodysplastic Syndromes. NEJM Evid (2022). DOI
- Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res (2019). PubMed
- Ioannidis NM, Rothstein JH, Pejaver V, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet (2016). PubMed