Michigan Imputation Server
Michigan Imputation Server pre-phases typed genotypes using HAPI-UR, SHAPEIT, or EAGLE (default is EAGLE2), imputes using Minimac3 imputation engine and outputs Blocked GNU Zip Format VCF files (.vcf.gz). These .vcf.gz files are used as input for gwasurvivr. Minimac uses slightly different metrics to assess imputation quality (\(R^2\) versus info score) and complete details as to minimac output are available on the Minimac3 Wikipage.
The function, michiganCoxSurv uses a modification of Cox proportional hazard regression from the R library survival:::coxph.fit. Built specifically for genetic data, michiganCoxSurv allows the user to filter on info (\(R^2\)) score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using RefPanelAF as the input arguement for maf.filter. Users are also provided with the sample minor allele frequency (MAF) in the sangerCoxSurv output, which can be used for filtering post analysis.
Samples can be selected for analyses by providing a vector of sample.ids. The output from Sanger imputation server returns the samples as SAMP1, ..., SAMPN, where N is the total number of samples. The sample order corresponds to the sample order in the vcf.file used for imputation. Note, sample order can also be found in the .fam file if genotyping data were initially in .bed, .bim and .fam (PLINK) format prior to conversion to VCF. If no sample list is specified all samples are included in the analyses.
vcf.file <- system.file(package="gwasurvivr",
                        "extdata", 
                        "michigan.chr14.dose.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
                        "extdata", 
                        "simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
                         sep=" ", 
                         header=TRUE,
                         stringsAsFactors = FALSE)
head(pheno.file)
##   ID_1  ID_2 event  time   age DrugTxYes    sex        group
## 1    1 SAMP1     0 12.00 33.93         0   male      control
## 2    2 SAMP2     1  7.61 58.71         1   male experimental
## 3    3 SAMP3     0 12.00 39.38         0 female      control
## 4    4 SAMP4     0  4.30 38.85         0   male      control
## 5    5 SAMP5     0 12.00 43.58         0   male experimental
## 6    6 SAMP6     1  2.60 57.74         0   male      control
# recode sex column and remove first column 
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2"  "SAMP5"  "SAMP7"  "SAMP9"  "SAMP11" "SAMP12"
In this example, we will select samples from the experimental group and will test survival only on these patients. The first column in the pheno.file are sample IDs, which link the phenotype file to the imputation file. We include age, DrugTxYes, and sex in the survival model as covariates.
We perform the analysis using the experimental group to demonstrate how one may want to prepare their data if interested in testing only on a subset of samples (i.e. a case-control study and survival of cases is of interest). Note that how the IDs (sample.ids) need to be a vector of class character. The chunk.size refers to size of each data chunk read in and is defaulted to 10,000 rows. Users can customize that to their needs. The larger the chunk.size the more memory (RAM) required to run the analysis. The recommended chunk.size=10000 and probably should not exceed chunk.size=100000. This does not mean that gwasurvivr is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (print.covs='only'), however users can select print.covs=all to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size.
 Single SNP analysis
Next we run michiganCoxSurv with the default, print.covs="only", load the results into R and provide descriptions of output by column. We will then run the analysis again using print.covs="all". verbose=TRUE is used for these examples so the function display messages while running.
Use ?michiganCoxSurv for argument specific documentation.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term=NULL,
                print.covs="only",
                out.file="michigan_only",
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=TRUE,
                clusterObj=NULL)
## Analysis started on 2024-05-01 at 00:07:31
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2024-05-01 at 00:07:47
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/Rtmp9Suy1w/michigan_only32b811747ce9bd.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/Rtmp9Suy1w/michigan_only32b811747ce9bd.coxph
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only".
michigan_only <- read.table("michigan_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(michigan_only))
## 'data.frame':    3 obs. of  21 variables:
##  $ RSID         : chr  "rs34919020" "rs8005305" "rs757545375"
##  $ TYPED        : logi  FALSE FALSE FALSE
##  $ CHR          : int  14 14 14
##  $ POS          : int  19459185 20095842 20097287
##  $ REF          : chr  "C" "G" "A"
##  $ ALT          : chr  "T" "T" "G"
##  $ AF           : num  0.301 0.515 0.52
##  $ MAF          : num  0.301 0.485 0.48
##  $ SAMP_FREQ_ALT: num  0.355 0.517 0.52
##  $ SAMP_MAF     : num  0.355 0.483 0.48
##  $ R2           : num  0.552 0.479 0.481
##  $ ER2          : logi  NA NA NA
##  $ PVALUE       : num  0.713 0.805 0.798
##  $ HR           : num  1.224 0.885 0.881
##  $ HR_lowerCI   : num  0.417 0.333 0.332
##  $ HR_upperCI   : num  3.6 2.35 2.34
##  $ Z            : num  0.368 -0.246 -0.255
##  $ COEF         : num  0.203 -0.123 -0.127
##  $ SE.COEF      : num  0.55 0.498 0.498
##  $ N            : int  52 52 52
##  $ N.EVENT      : int  21 21 21
 
 SNP with covariate interaction
A SNP*covariate interaction can be implemented using the inter.term argument. In this example, we will use DrugTxYes from the covariate file as the covariate we want to test for interaction with the SNP.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
                covariate.file=pheno.file,
                id.column="ID_2",
                sample.ids=sample.ids,
                time.to.event="time",
                event="event",
                covariates=c("age", "SexFemale", "DrugTxYes"),
                inter.term="DrugTxYes",
                print.covs="only",
                out.file="michigan_intx_only",
                r2.filter=0.3,
                maf.filter=0.005,
                chunk.size=100,
                verbose=FALSE,
                clusterObj=NULL)
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only".
michigan_intx_only <- read.table("michigan_intx_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(michigan_intx_only))
## 'data.frame':    3 obs. of  21 variables:
##  $ RSID         : chr  "rs34919020" "rs8005305" "rs757545375"
##  $ TYPED        : logi  FALSE FALSE FALSE
##  $ CHR          : int  14 14 14
##  $ POS          : int  19459185 20095842 20097287
##  $ REF          : chr  "C" "G" "A"
##  $ ALT          : chr  "T" "T" "G"
##  $ AF           : num  0.301 0.515 0.52
##  $ MAF          : num  0.301 0.485 0.48
##  $ SAMP_FREQ_ALT: num  0.355 0.517 0.52
##  $ SAMP_MAF     : num  0.355 0.483 0.48
##  $ R2           : num  0.552 0.479 0.481
##  $ ER2          : logi  NA NA NA
##  $ PVALUE       : num  0.9017 0.0806 0.0786
##  $ HR           : num  1.15 7.03 7.11
##  $ HR_lowerCI   : num  0.127 0.789 0.799
##  $ HR_upperCI   : num  10.4 62.7 63.2
##  $ Z            : num  0.124 1.747 1.759
##  $ COEF         : num  0.139 1.951 1.961
##  $ SE.COEF      : num  1.12 1.12 1.11
##  $ N            : int  52 52 52
##  $ N.EVENT      : int  21 21 21
 
 
 Sanger Imputation Server
Sanger Imputation Server pre-phases typed genotypes using either SHAPEIT or EAGLE, imputes genotypes using PBWT algorithm and outputs a .vcf.gz file for each chromosome. These .vcf.gz files are used as input for gwasurvivr. The function, sangerCoxSurv uses a modification of Cox proportional hazard regression from survival::coxph.fit. Built specifically for genetic data, sangerCoxSurv allows the user to filter on info score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using RefPanelAF as the input arguement for maf.filter. Users are also provided with the sample minor allele frequency in the sangerCoxSurv output.
Samples can be selected for analyses by providing a vector of sample.ids. The output from Sanger imputation server returns the samples as SAMP1, ..., SAMPN, where N is the total number of samples. The sample order corresponds to the sample order in the VCF file used for imputation. Note, sample order can also be found in the .fam file if genotyping data were initially in .bed, .bim and .fam (PLINK) format prior to conversion to VCF. If no sample list is specified, all samples are included in the analyses.
In this example, we will select samples from the experimental group and will test survival only on these patients. The first column in the pheno.file are sample IDs (we will match on these). We include age, DrugTxYes, and sex in the survival model as covariates.
vcf.file <- system.file(package="gwasurvivr",
                        "extdata", 
                        "sanger.pbwt_reference_impute.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
                        "extdata", 
                        "simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
                         sep=" ",
                         header=TRUE,
                         stringsAsFactors = FALSE)
head(pheno.file)
##   ID_1  ID_2 event  time   age DrugTxYes    sex        group
## 1    1 SAMP1     0 12.00 33.93         0   male      control
## 2    2 SAMP2     1  7.61 58.71         1   male experimental
## 3    3 SAMP3     0 12.00 39.38         0 female      control
## 4    4 SAMP4     0  4.30 38.85         0   male      control
## 5    5 SAMP5     0 12.00 43.58         0   male experimental
## 6    6 SAMP6     1  2.60 57.74         0   male      control
# recode sex column and remove first column 
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2"  "SAMP5"  "SAMP7"  "SAMP9"  "SAMP11" "SAMP12"
We perform the analysis using the experimental group to demonstrate how one may want to prepare their data if not initially all samples are patients or cases (i.e. a case-control study and survival of cases is of interest). We also are showing how the IDs (sample.ids) need to be a vector of class character. The chunk.size refers to size of each data chunk read in and is defaulted to 10,000 rows, users can customize that to their needs. The larger the chunk.size the more memory (RAM) required to run the analysis. The recommended chunk.size=10000 and probably should not exceed chunk.size=100000. This does not mean that gwasurvivr is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (print.covs='only'), however users can select print.covs=all to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size.
Use ?sangerCoxSurv for argument specific documentation.
 Single SNP analysis
Next we run sangerCoxSurv with the default, print.covs="only", load the results into R and provide descriptions of output by column. verbose=TRUE is used for these examples so the function display messages while running.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term=NULL,
              print.covs="only",
              out.file="sanger_only",
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
## Analysis started on 2024-05-01 at 00:08:06
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2024-05-01 at 00:08:22
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/Rtmp9Suy1w/sanger_only32b8117cfa72ea.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/Rtmp9Suy1w/sanger_only32b8117cfa72ea.coxph
Here we load the data and glimpse the first few values in each column from the survival analyses.
str(head(sanger_only))
## 'data.frame':    3 obs. of  19 variables:
##  $ RSID         : chr  "rs34919020" "rs8005305" "rs757545375"
##  $ TYPED        : logi  FALSE FALSE FALSE
##  $ CHR          : int  14 14 14
##  $ POS          : int  19459185 20095842 20097287
##  $ REF          : chr  "C" "G" "A"
##  $ ALT          : chr  "T" "T" "G"
##  $ RefPanelAF   : num  0.301 0.515 0.52
##  $ SAMP_FREQ_ALT: num  0.355 0.517 0.52
##  $ SAMP_MAF     : num  0.355 0.483 0.48
##  $ INFO         : num  0.552 0.479 0.481
##  $ PVALUE       : num  0.713 0.805 0.798
##  $ HR           : num  1.224 0.885 0.881
##  $ HR_lowerCI   : num  0.417 0.333 0.332
##  $ HR_upperCI   : num  3.6 2.35 2.34
##  $ Z            : num  0.368 -0.246 -0.255
##  $ COEF         : num  0.203 -0.123 -0.127
##  $ SE.COEF      : num  0.55 0.498 0.498
##  $ N            : int  52 52 52
##  $ N.EVENT      : int  21 21 21
Column names with descriptions from the survival analyses with covariates, specifying the default print.covs="only"
 
 SNP with covariate interaction
A SNP*covariate interaction can be implemented using the inter.term argument. In this example, we will use DrugTxYes from the covariate file as the covariate we want to test for interaction with the SNP.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
              covariate.file=pheno.file,
              id.column="ID_2",
              sample.ids=sample.ids,
              time.to.event="time",
              event="event",
              covariates=c("age", "SexFemale", "DrugTxYes"),
              inter.term="DrugTxYes",
              print.covs="only",
              out.file="sanger_intx_only",
              info.filter=0.3,
              maf.filter=0.005,
              chunk.size=100,
              verbose=TRUE,
              clusterObj=NULL)
 
 
 IMPUTE2 Imputation
IMPUTE2 is a genotype imputation and haplotype phasing program (Howie et al 2009). IMPUTE2 outputs 6 files for each chromosome chunk imputed (usually 5 MB in size). Only 2 of these files are required for analyses using gwasurvivr:
- Genotype file (.impute)
 
- Sample file (.sample)
More information can be read about these formats
We are going to load in and pre-process the genetic data and the phenotypic data (covariate.file).
impute.file <- system.file(package="gwasurvivr",
                            "extdata",
                            "impute_example.impute2.gz")
sample.file <- system.file(package="gwasurvivr",
                           "extdata", 
                           "impute_example.impute2_sample")
covariate.file <- system.file(package="gwasurvivr", 
                              "extdata",
                              "simulated_pheno.txt")
covariate.file <- read.table(covariate.file, 
                             sep=" ",
                             header=TRUE,
                             stringsAsFactors = FALSE)
covariate.file$SexFemale <- ifelse(covariate.file$sex=="female", 1L, 0L)
sample.ids <- covariate.file[covariate.file$group=="experimental",]$ID_2
To perform survival analysis using IMPUTE2 the function arguments are very similarto michiganCoxSurv and sangerCoxSurv, however the function now takes a chromosome arguement. This is needed to properly annotate the file output with the chromosome that these SNPs are in. This is purely an artifact of IMPUTE2 and how we leverage GWASTools in this function.
 Single SNP analysis
First we will do the analysis with no interaction term, followed by doing the analysis with the interaction term. The recommended output setting for single SNP analysis is print.cov="only".
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term=NULL,
               print.covs="only",
               out.file="impute_example_only",
               chunk.size=100,
               maf.filter=0.005,
               exclude.snps=NULL,
               flip.dosage=TRUE,
               verbose=TRUE,
               clusterObj=NULL)
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analysis started on 2024-05-01 at 00:08:37
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
##     open the file '/tmp/Rtmp9Suy1w/32b8115bb60884.gds' (3.3K)
##     # of fragments: 30
##     save to '/tmp/Rtmp9Suy1w/32b8115bb60884.gds.tmp'
##     rename '/tmp/Rtmp9Suy1w/32b8115bb60884.gds.tmp' (2.4K, reduced: 987B)
##     # of fragments: 14
## ***** Compression time ******
## User:0.061
## System: 0.023
## Elapsed: 0.085
## *****************************
## Analyzing part 1/1...
## 0 SNPs were removed from the analysis for not meeting
## the given threshold criteria or for having MAF = 0
## List of removed SNPs are saved to /tmp/Rtmp9Suy1w/impute_example_only32b811293f50be.snps_removed
## In total 3 SNPs were included in the analysis
## The Cox model results output was saved to /tmp/Rtmp9Suy1w/impute_example_only32b811293f50be.coxph
## Analysis completed on 2024-05-01 at 00:08:39
Here we load the data and glimpse the first few values in each column output.
impute2_res_only <- read.table("impute_example_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_only))
## 'data.frame':    3 obs. of  17 variables:
##  $ RSID       : chr  "rs34919020" "rs8005305" "rs757545375"
##  $ TYPED      : chr  "---" "---" "---"
##  $ CHR        : int  14 14 14
##  $ POS        : int  19459185 20095842 20097287
##  $ A0         : chr  "C" "G" "A"
##  $ A1         : chr  "T" "T" "G"
##  $ exp_freq_A1: num  0.355 0.517 0.52
##  $ SAMP_MAF   : num  0.355 0.483 0.48
##  $ PVALUE     : num  0.713 0.805 0.798
##  $ HR         : num  1.224 0.885 0.881
##  $ HR_lowerCI : num  0.417 0.333 0.332
##  $ HR_upperCI : num  3.6 2.35 2.34
##  $ Z          : num  0.368 -0.246 -0.255
##  $ COEF       : num  0.203 -0.123 -0.127
##  $ SE.COEF    : num  0.55 0.498 0.498
##  $ N          : int  52 52 52
##  $ N.EVENT    : int  21 21 21
 
 SNP covariate interaction
Now we will perform a SNP*covariate interaction survival analysis using impute2CoxSurv.
impute2CoxSurv(impute.file=impute.file,
               sample.file=sample.file,
               chr=14,
               covariate.file=covariate.file,
               id.column="ID_2",
               sample.ids=sample.ids,
               time.to.event="time",
               event="event",
               covariates=c("age", "SexFemale", "DrugTxYes"),
               inter.term="DrugTxYes",
               print.covs="only",
               out.file="impute_example_intx",
               chunk.size=100,
               maf.filter=0.005,
               flip.dosage=TRUE,
               verbose=FALSE,
               clusterObj=NULL,
               keepGDS=FALSE)
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
##     open the file '/tmp/Rtmp9Suy1w/32b811334d34e8.gds' (3.3K)
##     # of fragments: 30
##     save to '/tmp/Rtmp9Suy1w/32b811334d34e8.gds.tmp'
##     rename '/tmp/Rtmp9Suy1w/32b811334d34e8.gds.tmp' (2.4K, reduced: 987B)
##     # of fragments: 14
## ***** Compression time ******
## User:0.059
## System: 0.016
## Elapsed: 0.075
## *****************************
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only".
impute2_res_intx <- read.table("impute_example_intx.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_intx))
## 'data.frame':    3 obs. of  17 variables:
##  $ RSID       : chr  "rs34919020" "rs8005305" "rs757545375"
##  $ TYPED      : chr  "---" "---" "---"
##  $ CHR        : int  14 14 14
##  $ POS        : int  19459185 20095842 20097287
##  $ A0         : chr  "C" "G" "A"
##  $ A1         : chr  "T" "T" "G"
##  $ exp_freq_A1: num  0.355 0.517 0.52
##  $ SAMP_MAF   : num  0.355 0.483 0.48
##  $ PVALUE     : num  0.9017 0.0806 0.0786
##  $ HR         : num  1.15 7.03 7.11
##  $ HR_lowerCI : num  0.127 0.789 0.799
##  $ HR_upperCI : num  10.4 62.7 63.2
##  $ Z          : num  0.124 1.747 1.759
##  $ COEF       : num  0.139 1.951 1.961
##  $ SE.COEF    : num  1.12 1.12 1.11
##  $ N          : int  52 52 52
##  $ N.EVENT    : int  21 21 21