--- title: "Getting started with subgxe" output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Getting started with subgxe} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction `subgxe` is an R package for combining summary data from multiple association studies or multiple phenotypes in a single study by incorporating potential gene-environment (G-E) interactions into the testing procedure. It is an implementation of the *p value-assisted subset testing for associations (pASTA)* framework proposed by Yu et al(2019). The goal is to identify a subset of studies or traits that yields the strongest evidence of associations and give a meta-analytic p-value. This vignette offers a brief introduction to the basic use of `subgxe`. For more details on the algorithms used by `subgxe` please refer to the paper. ## subgxe Example We use simulated data of $K=5$ independent case-control studies that come along with the package to illustrate the basic use of `subgxe`. In each data set, `G`, `E`, and `D` denote the genetic variant, environmental factor, and disease status (binary outcome), respectively. In this case, `G` is coded as binary (under a dominant or recessive susceptibility model). It can also be coded as allele count (under the additive model). Two of the 5 studies have non-null genetic associations with the true marginal genetic odds ratio being 1.09. Each study has 6,000 cases and 6,000 controls, with the total sample size $n_k$ being 12,000. For the specific underlying parameters of the data generating model, please refer to the original [article](https://doi.org/10.1159/000496867) (Table A4, Scenario 2). ```{r setup} # library(devtools) # install_github("umich-cphds/subgxe", build_opts = c()) library(subgxe) ``` We first obtain a $K\times1$ vector of input p-values by conducting association test for each study. For study $k$, $k=1,\cdots,K$, the *joint* model with G-E interaction is $$\text{logit}[E(D_{ki}|G_{ki},E_{ki})]=\beta_0^{(k)}+\beta_G^{(k)}G_{ki}+\beta_E^{(k)}E_{ki}+\beta_{GE}^{(k)}G_{ki}E_{ki}$$ where $i=1,\cdots,n_k$. The model can be further adjusted for potential confounders, which we drop from the presentation for the simplicity of notation. To detect the genetic association while accounting for the G-E interaction, one can test the null hypothesis $$\beta_{G}^{(k)}=\beta_{GE}^{(k)}=0$$ based on the joint model for each study $k$. The coefficients can be estimated by maximum likelihood using the `glm` function. For alternative null hypotheses and methods for estimation of coefficients, see the [reference](https://doi.org/10.1159/000496867) mentioned above. A common choice for testing the null hypothesis $\beta_{G}^{(k)}=\beta_{GE}^{(k)}=0$ is the likelihood ratio test (LRT) with 2 degrees of freedom, which can be carried out with the `lrtest` function in the package `lmtest`. We use the results of LRT as an example to demonstrate the use of `subgxe`. For comparative purposes, we also look at the p-values of the *marginal* genetic associations obtained by Wald test, i.e. the p-values of $\hat{\alpha}_G^{(k)}$ in the model $$\text{logit}[E(D_{ki}|G_{ki})]=\alpha_0^{(k)}+\alpha_G^{(k)}G_{ki}.$$ ```{r pvalues, message = FALSE} library(lmtest) K <- 5 # number of studies study.pvals.marg <- NULL study.pvals.joint <- NULL for(i in 1:K){ joint.model <- glm(D ~ G + E + I(G*E), data=studies[[i]], family="binomial") null.model <- glm(D ~ E, data=studies[[i]], family="binomial") marg.model <- glm(D ~ G, data=studies[[i]], family="binomial") study.pvals.marg[i] <- summary(marg.model)$coef[2,4] study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2,5] } ``` Then we use the `pasta()` function in the `subgxe` package to conduct subset analysis and obtain a meta-analytic p-value for the genetic association. * The `cor` parameter is a correlation matrix of the study-specific p-values. In this example, since the studies are independent, the p-values are independent as well, and therefore the `cor` should be an identity matrix. In a *multiple-phenotype* analysis where the phenotypes are measured on the same set of subjects, one way to approximate the correlations among p-values is to use the phenotypic correlations. ```{r pasta} study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]), nrow(studies[[4]]), nrow(studies[[5]])) cor.matrix <- diag(1, K) pasta.joint <- pasta(p.values=study.pvals.joint, study.sizes=study.sizes, cor=cor.matrix) pasta.marg <- pasta(p.values=study.pvals.marg, study.sizes=study.sizes, cor=cor.matrix) pasta.joint$p.pasta # delete 'joint' pasta.joint$test.statistic$selected.subset pasta.marg$p.pasta # delete 'joint' pasta.marg$test.statistic$selected.subset ``` From the output we observe that when the G-E interaction is taken into account, `pasta` yields a meta-analytic p-value of `r formatC(pasta.joint$p.pasta, format = "f", digits = 3)` and identifies the first two studies as non-null. On the other hand, if we only consider the marginal associations, the meta-analytic p-value becomes much larger (p=`r formatC(pasta.marg$p.pasta, format = "f", digits = 3)`) and the first three studies are identified as having significant associations. ## Reference * Yu Y, Xia L, Lee S, Zhou X, Stringham H, M, Boehnke M, Mukherjee B: Subset-Based Analysis Using Gene-Environment Interactions for Discovery of Genetic Associations across Multiple Studies or Phenotypes. *Hum Hered* 2019. doi: [10.1159/000496867](https://doi.org/10.1159/000496867)