Using GxEScanR

Introduction

GxEScanR is designed to efficiently run genome-wide association study (GWAS) and genome-wide by environment interaction study (GWEIS) scans using imputed genotypes stored in the BinaryDosage format. The phenotype to be analyzed can either be a continuous or binary trait. The GWEIS scan performs multiple tests that can be used in two-step methods.

Notation and Models

There are five models that can be fit using GxEScanR when running a GWEIS, plus one that must be fit before running the GWEIS.

\(Y\) represents the outcome or phenotype

\(X\) represents the confounding covariates

\(E\) represents the covariate that will have its genetic interaction tested

\(G\) represents the genetic data

\(\beta\) represents the effect estimate that will be fitted in the model

Model 0 - Environment-only Model

This model contains the phenotype, confounding covariates, and the interaction covariate. It must be fit prior to running the GWEIS using the glm() routine.

\[Y = \beta_X X + \beta_E E\]

Model 1 - Gene-only Model

This model contains all the values from the environment-only model and adds the genetic data.

\[Y = \beta_X X + \beta_E E + \beta_G G\]

\(\beta_G\) is the only term tested in this model and is represented by bg_ge in the output.

Model 2 - GxE Interaction Model

This model contains all the values from the gene-only model and adds the gene-environment interaction term.

\[Y = \beta_X X + \beta_E E + \beta_G G + \beta_{GxE} GE\]

\(\beta_G\) and \(\beta_{GxE}\) are both tested for this model as well as a joint test for both terms. They are represented by bg_gxe, bgxe, and joint, respectively, in the output.

Model 3 - EG Model

This model tests for a relationship between the environmental value of interest and the gene. If \(E\) is coded 0,1, a logistic model is fitted; otherwise a linear model is fitted.

\[E = \beta_X X + \beta_G G\]

\(\beta_G\) is tested in this model and is represented by bg_eg in the output.

Model 4 - Case-only Model

This model is the same as model 3 except it is run on cases only. \(\beta_G\) is tested and is represented by bg_case in the output. This model can only be run on a dichotomous phenotype.

Model 5 - Control-only Model

This model is the same as model 3 except it is run on controls only. \(\beta_G\) is tested and is represented by bg_ctrl in the output. This model can only be run on a dichotomous phenotype.

Steps to Run a GWEIS

There are five steps to running a GWEIS using GxEScanR:

  1. Preparing the Data (Genetic Data + Subject Data)
  2. Fitting the Base Model
  3. Allocating Memory for the GWEIS
  4. Running the GWEIS
  5. Processing the Results

Step 1: Preparing the Data

Genetic Data

GxEScanR uses data stored in the BinaryDosage format. The BinaryDosage package converts files from the VCF format to the BinaryDosage format using the vcftobd() routine. The getbdinfo() routine is then used to load the information about the BinaryDosage file needed to run the GWEIS.

The following example shows how to convert a VCF file to BinaryDosage format. Note that vcftobd() requires the vcfppR package.

vcffile <- system.file("extdata", "gendata.vcf.gz", package = "GxEScanR")

exampledir <- tempdir()
bdosefile <- file.path(exampledir, "gendata.bdose")

BinaryDosage::vcftobd(vcffile = vcffile, bdose_file = bdosefile)
bdinfo <- BinaryDosage::getbdinfo(bdosefile)

This vignette uses a pre-converted BinaryDosage file included with the package.

bdosefile <- system.file("extdata", "gendata.bdose", package = "GxEScanR")
bdinfo <- BinaryDosage::getbdinfo(bdosefile)
exampledir <- tempdir()

Subject Data

The subject data consists of subject IDs used to link the genetic data, and the phenotype and covariates used in the GWEIS. The data included with GxEScanR has two phenotypes (one continuous, one dichotomous) and two covariates (x1 and x2).

Important: All values for phenotypes and covariates must be numeric; factors are not allowed.

Important: All subject IDs must be character values.

Important: When using a dichotomous phenotype, it must be coded 0,1.

subjectfile <- system.file("extdata", "subdata.rds", package = "GxEScanR")
subjectdata <- readRDS(subjectfile)
head(subjectdata)
#>      subid y_linear y_logistic x1 x2
#> 1 subject1 7.939120          1  0  0
#> 2 subject2 5.965786          0  1  0
#> 3 subject3 3.047969          0  1  0
#> 4 subject4 1.173026          0  0  0
#> 5 subject5 7.691592          1  1  0
#> 6 subject6 3.931234          0  1  0

Step 2: Fitting the Base Model

The base model is fit using glm() using data that contains only the subject IDs, phenotype, and covariates of interest — no genetic data.

It is important to:

The formula passed to glm() must list the covariate whose interaction with the genetic data is being tested last. In this example we test the interaction with x1, so the formula is y ~ x2 + x1.

Continuous Phenotype

# Remove y_logistic from the subject data
lineardata <- subjectdata[, c(1, 2, 4, 5)]

# Keep only subjects with complete data
lineardata <- lineardata[complete.cases(lineardata), ]

# Keep only subjects with genetic data
lineardata <- lineardata[lineardata$subid %in% bdinfo$samples$sid, ]

# Fit the linear model
linearmodel <- glm(y_linear ~ x2 + x1, data = lineardata)

Dichotomous Phenotype

# Remove y_linear from the subject data
logisticdata <- subjectdata[, c(1, 3, 4, 5)]

# Keep only subjects with complete data
logisticdata <- logisticdata[complete.cases(logisticdata), ]

# Keep only subjects with genetic data
logisticdata <- logisticdata[logisticdata$subid %in% bdinfo$samples$sid, ]

# Fit the logistic model
logisticmodel <- glm(y_logistic ~ x2 + x1, family = binomial, data = logisticdata)

Step 3: Allocating Memory for the GWEIS

Memory must be allocated before running the GWEIS using gweis.mem(). This pre-computation makes the scan substantially faster. Pass the fitted base model, the subject IDs from that model’s data, and the list of tests to perform.

Typical test sets:

# Continuous outcome
linearmem <- gweis.mem(gemdl = linearmodel,
                       subids = lineardata$subid,
                       tests = c("bg_ge", "bg_gxe", "bgxe", "joint"))

# Dichotomous outcome
logisticmem <- gweis.mem(gemdl = logisticmodel,
                         subids = logisticdata$subid,
                         tests = c("bg_ge", "bg_gxe", "bgxe", "joint",
                                   "bg_eg", "bg_case", "bg_ctrl"))

Step 4: Running the GWEIS

Pass the memory object from gweis.mem(), the bdinfo object, a vector of SNP indices or IDs to test, and an output file path to rungweis(). Results are written as a tab-delimited text file.

snpindex <- 1:nrow(bdinfo$snps)

# Continuous outcome
linearresults <- file.path(exampledir, "linear.txt")
rungweis(gweismem = linearmem,
         bdinfo = bdinfo,
         snps = snpindex,
         outfilename = linearresults)

# Dichotomous outcome
logisticresults <- file.path(exampledir, "logistic.txt")
rungweis(gweismem = logisticmem,
         bdinfo = bdinfo,
         snps = snpindex,
         outfilename = logisticresults)

Step 5: Processing the Results

Read the output file with read.table(). The following describes the output columns, illustrated with the first three SNPs.

lineardf <- read.table(linearresults, header = TRUE, sep = "\t")
logisticdf <- read.table(logisticresults, header = TRUE, sep = "\t")

SNP Information and Allele Frequency

Every row contains SNP ID (snpid), chromosome (chr), position (loc), reference allele (ref), alternate allele (alt), and alternate allele frequency (aaf). For a dichotomous outcome, aaf_case and aaf_ctrl are also included.

knitr::kable(lineardf[1:3, 1:6], caption = "SNP information — continuous outcome")
SNP information — continuous outcome
snpid chr loc ref alt aaf
chr22:10527916:T:G chr22 10527916 T G 0.6494005
chr22:10648779:G:T chr22 10648779 G T 0.0854975
chr22:10648794:G:A chr22 10648794 G A 0.0536727
knitr::kable(logisticdf[1:3, 1:8], caption = "SNP information — dichotomous outcome")
SNP information — dichotomous outcome
snpid chr loc ref alt aaf aaf_case aaf_ctrl
chr22:10527916:T:G chr22 10527916 T G 0.6494005 0.6264272 0.6540861
chr22:10648779:G:T chr22 10648779 G T 0.0854975 0.0882864 0.0849287
chr22:10648794:G:A chr22 10648794 G A 0.0536727 0.0457573 0.0552871

Model 1 Output: bg_ge

bg_ge is the \(\beta_G\) estimate from model 1. bg_ge_lrt is the LRT 1 df chi-squared statistic for \(\beta_G = 0\). Output when "bg_ge" is in tests.

knitr::kable(lineardf[1:3, 7:8], caption = "Model 1 results")
Model 1 results
bg_ge bg_ge_lrt
-0.8900621 5.2558824
-0.4766408 0.6737040
-0.6481754 0.8070534

Model 2 Output: bg_gxe, bgxe, joint

knitr::kable(lineardf[1:3, 9:13], caption = "Model 2 results")
Model 2 results
bg_gxe bg_gxe_lrt bgxe bgxe_lrt joint_lrt
-1.4864649 8.4363261 1.4016489 3.2015268 8.4574092
-0.4591966 0.3719000 -0.0430555 0.0013243 0.6750284
-0.6509083 0.5512644 0.0084781 0.0000301 0.8070835

Models 3–5 Output: bg_eg, bg_case, bg_ctrl

Each test produces a \(\beta_G\) estimate and a 1 df LRT statistic. Output when "bg_eg", "bg_case", or "bg_ctrl" are in tests, respectively.

knitr::kable(logisticdf[1:3, 16:21], caption = "Models 3-5 results")
Models 3-5 results
bg_eg bg_eg_lrt bg_case bg_case_lrt bg_ctrl bg_ctrl_lrt
-0.3460221 0.7862570 0.8264053 0.8541469 -0.4673483 1.1215720
-0.0867130 0.0219140 -0.7545895 0.4064202 0.0815513 0.0144966
-0.2255839 0.0949399 -0.4952261 0.0361789 -0.0220965 0.0008210