--- title: "Statistics deep dive (Jaccard, Dice, hypergeometric, BH-FDR)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Statistics deep dive (Jaccard, Dice, hypergeometric, BH-FDR)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") # Skip evaluation of all chunks on CRAN's auto-check farm to fit the # 10-minute build budget. Locally, on CI, and under devtools::check(), # NOT_CRAN=true and all chunks evaluate normally. The vignette source # (which CRAN users see in browseVignettes() / vignette()) is unchanged. NOT_CRAN <- identical(tolower(Sys.getenv("NOT_CRAN")), "true") knitr::opts_chunk$set(eval = NOT_CRAN) ``` # Statistics deep dive `vennDiagramLab` reports five pairwise metrics for every set pair plus a multiple-testing correction. This vignette explains what each metric means, when to prefer it, and how to reproduce the values that appear in the web tool's significance coloring. ```{r load} library(vennDiagramLab) result <- analyze(load_sample("dataset_real_cancer_drivers_4")) stats <- statistics(result) ``` ## The five metrics For two sets `A` and `B` of sizes `|A|`, `|B|` with intersection `|A ∩ B|` drawn from a universe of `N` items: * **Jaccard** = `|A ∩ B| / |A ∪ B|`. Range `[0, 1]`. Symmetric. * **Dice** = `2 |A ∩ B| / (|A| + |B|)`. Range `[0, 1]`. Symmetric. Always `>= Jaccard`; the two relate by `Dice = 2J / (1 + J)`. * **Overlap coefficient** = `|A ∩ B| / min(|A|, |B|)`. Range `[0, 1]`. Equal to 1 when one set is contained in the other. * **Hypergeometric p-value** — probability of observing `|A ∩ B|` or more shared items by chance, given `|A|`, `|B|`, and `N`. Tests over- representation. * **Fold enrichment** = `(|A ∩ B| * N) / (|A| * |B|)`. The ratio of observed to expected intersection size under independence. `> 1` is over- representation. ## Compute one metric directly The helpers are exported and stateless: ```{r helpers} jaccard(size_a = 138, size_b = 581, intersection = 100) dice(size_a = 138, size_b = 581, intersection = 100) overlap_coefficient(size_a = 138, size_b = 581, intersection = 100) hypergeometric_p_value(N = 20000, K = 138, n = 581, k = 100) fold_enrichment(N = 20000, K = 138, n = 581, k = 100) ``` The hypergeometric p-value is essentially zero: 100 shared genes out of an expected `(138 * 581) / 20000 ≈ 4` is a 25× enrichment. ## All pairs at once `statistics(result)` returns five tables (four square `NxN` matrices for the ratio metrics + a long-form data.frame for the hypergeometric test): ```{r stats-tables} stats@jaccard ``` ```{r stats-hyp} head(stats@hypergeometric) ``` ## BH-FDR adjustment `stats@hypergeometric` already carries the BH-FDR-adjusted q-value (`p_adjusted`) and a boolean `significant` (q < 0.05) and `highly_significant` (q < 0.001). The adjustment uses `stats::p.adjust(method = "BH")`: ```{r bh-fdr} raw_p <- stats@hypergeometric$p_value adjusted <- bh_fdr(raw_p) all.equal(adjusted, stats@hypergeometric$p_adjusted) ``` For unrelated p-values, BH-FDR is more permissive than Bonferroni and more conservative than no correction: ```{r bh-fdr-toy} toy_p <- c(0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 0.9) data.frame( raw = toy_p, bonferroni = pmin(toy_p * length(toy_p), 1), bh_fdr = bh_fdr(toy_p) ) ``` ## broom-compatible long-form `broom::tidy()` produces a tibble that's pipeline-friendly (one row per pair, all metrics in one frame, sorted by adjusted p-value): ```{r tidy} broom::tidy(result) ``` ## Reproducing the web tool's coloring The web tool colors significant pairs via the same `p_adjusted` thresholds: ```{r reproduce} sig_table <- broom::tidy(result) sig_table$colour <- ifelse(sig_table$highly_significant, "red", ifelse(sig_table$significant, "orange", "grey")) sig_table[, c("set_a", "set_b", "p_adjusted", "colour")] ``` ## What's next * `vignette("v02_real_cancer_drivers")` — see these stats in the context of a real biological analysis. * `vignette("v06_pipeline_integration")` — feed `broom::tidy()` into a downstream tidyverse pipeline.