Statistical Rarity
Bayesian rare-event estimation, 1 in 1.9 billion variant-level frequency
Jeffreys Median Ultra-rare
1.22e-5
Posterior median under Jeffreys Beta(0.5, 0.5) prior
Patients for 50%
55,875
Future cohort needed for >50% chance of 1 match
Variant Specificity
7.1 orders
Specific variants are 12.6 million times rarer than gene-level
Combined Search Zero
0 matches
26,642 patients (GENIE + HARMONY)
Jeffreys Posterior Estimation
With 0 observed quintuples in 18,625 myeloid patients, the Jeffreys
non-informative prior Beta(0.5, 0.5) yields a posterior Beta(0.5, 18,625.5).
The posterior median (1.22e-5) represents the best point estimate of the true
population frequency. The 95% credible interval spans
2.64e-8 to 1.35e-4, providing bounds on the
plausible frequency range.
| Statistic | Value | Interpretation |
|---|---|---|
| Posterior median | 1.22e-5 | Best point estimate of true frequency |
| Posterior mean | 2.68e-5 | Mean of Beta(0.5, 18,625.5) |
| 95% CI lower | 2.64e-8 | Lower 2.5th percentile |
| 95% CI upper | 1.35e-4 | Upper 97.5th percentile |
| One-sided 95% upper | 1.03e-4 | Conservative upper bound |
Source: bayesian_rarity_results.json.
Jeffreys prior: non-informative conjugate prior for binomial data.
Rule of Three (Frequentist)
The Rule of Three (Hanley and Lippman-Hand, 1983) provides a quick frequentist upper
bound: with 0 events in n trials, the 95% upper confidence limit is approximately 3/n.
For 18,625 patients, this gives an upper bound of 1.61e-4.
The Clopper-Pearson exact interval (1.61e-4) and
Poisson exact 99% bound (2.47e-4) provide alternative
frequentist estimates, all consistent with the Bayesian credible interval.
| Method | Upper Bound | Confidence Level |
|---|---|---|
| Rule of Three | 1.61e-4 | 95% |
| Clopper-Pearson exact | 1.61e-4 | 95% |
| Poisson exact | 2.47e-4 | 99% |
Source: bayesian_rarity_results.json.
Hanley JA, Lippman-Hand A. If nothing goes wrong, is everything all right? JAMA (1983).
Gene-Level vs Variant-Level Rarity
The five genes (DNMT3A, IDH2, SETBP1, PTPN11, EZH2) are individually common in
myeloid malignancies. At the gene level, the expected co-occurrence count is
1.94e-2, placing this
combination at the 77.8th percentile among all 278,256 possible five-gene
combinations from 34 myeloid genes. The rarity is driven entirely by variant specificity:
the specific amino acid changes are 7.1 orders of magnitude
(1.3e+7x) rarer than the
gene-level combination. EZH2 V662A (0 carriers in GENIE) is the primary driver of
variant-level rarity.
| Gene | Variant-to-Gene Ratio | Interpretation |
|---|---|---|
| DNMT3A | 0.2752 | Moderate specificity |
| IDH2 | 0.8084 | Moderate specificity |
| SETBP1 | 0.3170 | Moderate specificity |
| PTPN11 | 0.0210 | High specificity |
| EZH2 | 5.37e-5 | Extreme specificity (0 carriers) |
Gene-Level Expected
1.94e-2
77.8th percentile among 278,256 combinations
Variant-Level Expected
1.54e-9
7.1 orders rarer than gene-level
Quintuple Probability
~7.7e-13
1 in 1.3 trillion
Source: bayesian_rarity_results.json.
Extreme Value Theory: Coles S. Statistical Modeling of Extreme Values. Springer (2001).
Combined Database Search
Zero quintuple matches across two independent, non-overlapping databases totaling
26,642 myeloid patients. GENIE v19.0 provides 18,625 patients
with panel-based sequencing from 40+ academic medical centers worldwide. HARMONY
Alliance contributes 12,041 AML patients from European institutions with
a 41-gene panel (Eurofins-Biomnis). All five target genes are covered by both panels.
| Database | Patients | Panel | All 5 Genes Covered | Quintuple Matches |
|---|---|---|---|---|
| GENIE v19.0 | 18,625 | Variable (40+ panels) | Yes | 0 |
| HARMONY Alliance | 12,041 | 41-gene panel | Yes | 0 |
| Combined | 26,642 | 0 |
Gene-level frequencies in HARMONY (AML-only): DNMT3A 25.2%,
IDH2 12.7%,
PTPN11 7.5%,
EZH2 6%,
SETBP1 2%.
Under independence, the expected number of quintuples in HARMONY alone is
0.035.
Source: harmony_results.json.
AACR Project GENIE Consortium (2017). HARMONY Alliance (harmony-alliance.eu).
Posterior Predictive: Patients Needed
Using the Jeffreys posterior, the probability of observing at least one quintuple match
in a future cohort of M patients is computed via the posterior predictive distribution.
55,875 patients are needed for a 50% chance of finding one match.
7,431,474 patients are needed for 95% confidence. These numbers far exceed
current database sizes, confirming that the zero-match observation is consistent
with expectations and not a sampling artifact.
| Future Cohort Size | P(at least 1 match) |
|---|---|
| 1,000 | 2.6% |
| 5,000 | 11.2% |
| 10,000 | 19.3% |
| 50,000 | 47.9% |
| 100,000 | 60.4% |
| 500,000 | 81.0% |
| 1,000,000 | 86.5% |
50% Detection
55,875 patients
Future cohort needed for 50% chance of 1 match
95% Detection
7,431,474 patients
Future cohort needed for 95% chance of 1 match
Source: bayesian_rarity_results.json.
Posterior predictive under Jeffreys conjugate Beta prior.
Extreme Value Theory
All 278,256 possible five-gene combinations from
the 34 target myeloid genes were enumerated and their expected counts computed under
independence. A Generalized Extreme Value (GEV) distribution was fitted to the
log-transformed expected counts. The patient quintuple ranks 216,373 out
of 278,256 at the gene level (77.8th
percentile), confirming that these five genes are individually common enough in myeloid
malignancies that their gene-level co-occurrence is unremarkable. The rarity originates
from the specific variant combination, not the gene combination.
| EVT Metric | Value | Interpretation |
|---|---|---|
| Gene-level percentile | 77.8th | Genes common, combination not extreme |
| Gene-level expected count | 1.94e-2 | Expected quintuples at gene level |
| Variant specificity product | 7.93e-8 | Product of variant-to-gene ratios |
| Variant-level expected count | 1.54e-9 | Gene-level times specificity product |
| Rarity gap | 7.1 orders of magnitude | Variants 12.6 million times rarer than genes |
| GEV shape parameter | 0.2278 | Positive shape: heavy-tailed distribution |
The GEV fit confirms a heavy-tailed distribution (positive shape parameter
0.23), meaning a small number of gene combinations
have disproportionately high expected counts (e.g., DNMT3A+TET2+ASXL1 combinations)
while the vast majority are rare. The patient quintuple sits in the body of this
distribution at the gene level, but falls in the extreme tail at the variant level.
Source: bayesian_rarity_results.json.
Coles S. Statistical Modeling of Extreme Values. Springer (2001).
Babur et al. Genome Biol (2015). Canisius et al. Genome Biol (2016).
References
- AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov (2017). DOI
- Coles S. An Introduction to Statistical Modeling of Extreme Values. Springer Series in Statistics (2001). DOI
- Babur O, Gonen M, Aksoy BA, et al. Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Genome Biol (2015). DOI
- Canisius S, Martens JW, Wessels LF. A novel independence test for somatic alterations in cancer. Genome Biol (2016). DOI
- Hanley JA, Lippman-Hand A. If nothing goes wrong, is everything all right? Interpreting zero numerators. JAMA (1983). PubMed