\documentclass[12pt]{article} \SweaveOpts{keep.source=TRUE, expand=FALSE} %\VignetteIndexEntry{Trio Logic Regression and genotypic TDT} %\VignetteDepends{trio} %\VignettePackage{trio} \usepackage[margin=1.25in]{geometry} \usepackage{color,colordvi} \usepackage{amsmath,amstext} %\usepackage{psfig} \usepackage{natbib} \usepackage{graphicx} \RequirePackage[colorlinks=TRUE]{hyperref} \hypersetup{linkcolor=black, menucolor=black, urlcolor=blue, citecolor=black} %\usepackage[tight]{subfigure} %\usepackage{lscape,rotating,setspace,tabularx} \usepackage{fancyhdr} \usepackage{fancyvrb} \usepackage{setspace} \usepackage{verbatim} \usepackage{qingSweave} \newcommand{\captionfonts}{\small} %\setlength{\topmargin}{-15mm} \setlength{\parskip}{1.5ex plus0.5ex minus0.2ex} %\doublespacing \parindent0em \renewcommand{\familydefault}{cmss} \renewcommand{\baselinestretch}{1.5} \begin{document} \thispagestyle{empty} \bigskip \begin{center} \vspace*{3cm} {\LARGE \bf Preparing Case-Parent Trio Data\\[-6pt] and Detecting Disease-Associated\\[-6pt] SNPs, SNP Interactions,\\[-6pt] and Gene-Environment Interactions\\[12pt] with \texttt{trio}} \vspace*{0.9cm} {\Large Holger Schwender, Qing Li, and Ingo Ruczinski} \vspace*{0.1cm} \end{center} %\doublespace \newpage \tableofcontents \newpage \section{Introduction} The \texttt{R} package \texttt{trio} contains functions for performing genotypic transmission disequilibrium tests (gTDTs) and corresponding score tests to test whether the distributions of individual SNPs \citep{schaid1996}, two-way interactions of SNPs \citep{cordell2002, cordell2004}, or interactions between SNPs and binary environmental variables differ between the cases, i.e.\ the children affected by a disease, and the matched pseudo-controls, i.e.\ the combinations of alleles that were not transmitted from the parents to their offspring, but were also possible given the parents' genotypes. Additionally, \texttt{trio} also comprises a function for applying the allelic TDT \citep{spielman1993} to genotype data. Moreover, \texttt{trio} provides functionalities relevant for the analysis of case-parent trio data with {\it trio logic regression} \citep{li2010}. These are, on the one hand, functions that aid in the transformation of the trio data from standard linkage files (ped format) or genotype format into objects suitable as input for trio logic regression, and on the other hand, functions for applying trio logic regression and a bagging version of trio logic regression to these objects. Finally, \texttt{trio} provides functions for reading in and manipulating the case-parent trio data, estimating LD as well as LD-blocks, and simulating case-parent trio data, where the risk of disease is specified by (higher order) SNP interactions. In Section \ref{read} of this vignette, it is shown how family-based data stored in a linkage/ped file can be read into \texttt{R} and transformed into a format suitable for the application of the functions for performing the allelic and genotypic TDTs as well as the score tests. While Section \ref{testing} contains examples for the application of the gTDT functions to individual SNPs, two-way SNP interactions, and gene-environment interactions, Sections \ref{allelicTDT} and \ref{score} briefly show how the allelic TDT and the score tests related to the genotypic TDTs, respectively, can be used to test these factors. Section \ref{trio} is devoted to the steps relevant for data processing necessary to generate a matrix suitable as input for trio logic regression, starting from a linkage or genotype file, possibly containing missing data and/or Mendelian errors. We give some examples how missing data can be addressed using haplotype-based imputation. The haplotype information can be specified by the user, or when this information is not readily available, automatically inferred. The haplotype blocks are also relevant in the delineation of the genotypes for the pseudo-controls, as the linkage disequilibrium (LD) structure observed in the parents is taken into account in this process. While this function is intended to generate complete case-pseudo-control data as input for trio logic regression, an option to simply return the completed trio data is also available. In Sections \ref{triolr} and \ref{triofs}, it is shown how trio logic regression and trioFS \citep[trio Feature Selection;][]{schwender2011}, in which bagging with base learner trio logic regression is used to stabilize the search for relevant SNP interactions, can be applied to the data generated as described in Section \ref{trio}. For the estimation of the haplotype structure that might be used in the functions described in Section \ref{trio}, the \texttt{R} package \texttt{trio} also contains functions for computing and plotting the pairwise LD values and for detecting LD blocks. In Section \ref{getLD}, it is described how the pairwise values of the LD measures $D^\prime$ and $r^2$ can be computed with the function \texttt{getLD}, and how the $D^\prime$ values can be employed to estimate haplotype blocks with the algorithm of \citet{gabriel2002}. Finally, Section \ref{simulation} of the vignette explains in more detail how to set up simulations of case-parent trio data, where the risk of disease is specified by SNP interactions. The most time-consuming step for these types of simulations is the generation of mating tables and the respective probabilities. The mating table information, however, can be stored, which allows for fast simulations when replicates of the case-parent trio data are generated. \section{Preparing Data for the Genotypic TDTs}\label{read} \subsection{Dataframe in ped format} Case-parent trio data are typically stored in a ped file. The first six columns in such a ped file, which is also referred to as linkage file, identify the family structure of the data, and the phenotype. It is assumed that only one phenotype variable (column 6) is used. The object \texttt{trio.ped1}, available in the \texttt{R} package, is an example of a data set in ped format. It contains information for 10 SNPs in 100 trios. Besides the variables providing information on the family structure and the phenotypes (columns 1--6), each SNP is encoded in two variables denoting the alleles. <<>>= library(trio) data(trio.data) str(trio.ped1) trio.ped1[1:10,1:12] @ \subsection{Reading a ped file into R} If not already available as data frame or matrix in the \texttt{R} workspace, trio data can be read into \texttt{R} using the function \texttt{read.pedfile}. If we, for example, assume that the working directory of the current \texttt{R} session contains a file called "pedfile.ped" (this file is actually not available in \texttt{trio}, we just assume that such a file exists in the working directory), then this file can be read into \texttt{R} by calling \begin{verbatim} > ped <- read.pedfile("pedfile.ped") \end{verbatim} If the arguments \texttt{coded} and \texttt{first.row} of \texttt{read.pedfile} are not specified by the user, \texttt{read.pedfile} automatically tries to figures out how the alleles in the ped file are coded, and whether the first row contains the SNP names (\texttt{first.row = FALSE}) or the data for the first subject (\texttt{first.row = TRUE}). In the former case, \texttt{read.pedfile} adds the SNP names (with extensions \texttt{.1} and \texttt{.2} to differ between the two alleles) to the respective columns of the read-in data frame. \subsection{Transforming a data frame in ped format to a matrix in genotype format} For the applications of the functions for performing gTDTs (see Section \ref{testing}), the trio data must be in a matrix in genotype format. In such a matrix, each columns represents a SNP, which is coded by the number of minor alleles, and each block of 3 consecutive rows contains the genotypes of the father, the mother, and their offspring (in this order) of one specific trio. Missing values are allowed in this matrix, and need to be coded by \texttt{NA}. This matrix can either be generated from a data frame in ped format by employing the function \texttt{ped2geno}, or more conveniently, by setting \texttt{p2g = TRUE} in \texttt{read.pedfile}. Thus, a matrix in genotype format might be obtained from the above ped file by calling \begin{verbatim} > geno <- read.pedfile("pedfile.ped", p2g=TRUE) \end{verbatim} The output of these functions just contains the matrix in genotype format, whereas \texttt{trio.check} described in Section \ref{trio} additionally contains information about Mendelian errors. Instead of checking for Mendelian errors in \texttt{ped2geno} or \texttt{read.pedfile}, such errors are removed SNP-wise in the functions for performing genotypic TDTs. If, for example, the data frame \texttt{trio.ped1} should be transformed into a matrix in genotype format, \texttt{ped2geno} can be applied to it. However, \texttt{ped2geno} requires unique personal IDs (second column of \texttt{trio.ped1}) such that we first have to combine the family ID and the personal ID (which would be automatically done by \texttt{read.pedfile}), and change the IDs of the fathers and mothers in columns 3 and 4 likewise. <<>>= trio.ped1[,2] <- paste(trio.ped1[,1], trio.ped1[,2], sep="_") ids <- trio.ped1[,3] != 0 trio.ped1[ids,3] <- paste(trio.ped1[ids,1], trio.ped1[ids,3], sep="_") trio.ped1[ids,4] <- paste(trio.ped1[ids,1], trio.ped1[ids,4], sep="_") trio.ped1[1:5, 1:4] @ Afterwards, \texttt{ped2geno} can be applied to \texttt{trio.ped1} <<>>= geno <- ped2geno(trio.ped1) geno[1:5,] @ The matrix \texttt{trio.gen1} is the genotype matrix corresponding to \texttt{trio.ped1}. So the genotypes in the output of \texttt{ped2geno} are identical to \texttt{trio.gen1} (except for that the first two columns of \texttt{trio.gen1} contain the family ID and the personal ID). <<>>= trio.gen1[1:5, 3:12] table(trio.gen1[,3:12] == geno) @ \section{Genotypic TDTs}\label{testing} \subsection{Testing a Single SNP with a Genotypic TDT} A single SNP or two-way interaction can be tested with a gTDT by employing the functions \texttt{tdt} and \texttt{tdtGxG}. If we, for example, would like to test the first SNP in the matrix \texttt{mat.test} available in the \texttt{R} package \texttt{trio}, then this could be done by calling <<>>= tdt(mat.test[,1]) @ In this case, a conditional logistic regression is fitted, and the output of \texttt{tdt} contains the parameter estimate \texttt{Coef} for the SNP in this model, the relative risk \texttt{RR}, the \texttt{Lower} and \texttt{Upper} bound of the 95\% confidence interval of this relative risk, the standard error \texttt{SE} of the parameter estimate, the Wald \texttt{Statistic} for testing whether this SNP has an effect, and the corresponding \texttt{p-Value}. Note that in the case of trio data, \texttt{exp(Coef)} is an unbiased estimate for the relative risk, not for the odds ratio \citep{schaid1996}. By default, an additive effect is tested. It is, however, also possible to consider a dominant effect <<>>= tdt(mat.test[,1], model="dominant") @ or a recessive effect <<>>= tdt(mat.test[,1], model="recessive") @ \subsection{Testing a Single Interaction between two SNPs} Similarly the interaction between \texttt{SNP1} and \texttt{SNP2} in \texttt{mat.test} can be tested by <<>>= tdtGxG(mat.test[,1], mat.test[,2]) @ In this case, the interaction is tested for epistatic interactions as described in \citet{cordell2002} and \citet{cordell2004}. Thus, two conditional logistic regression models are fitted to the cases and the respective 15 matched pseudo-controls (i.e.\ the 15 possible, but not transmitted Mendelian genotype realizations, given the parents' genotypes at the two loci), one consisting of two coding variables for each of the two SNPs, and the other additionally containing the four possible interactions of these variables. The two fitted models are then compared by a likelihood ratio test, and the $p$-values are computed by approximation to a $\chi^2$-distribution with four degrees of freedom. Besides this likelihood ratio test (which is the default for the argument \texttt{test}), \texttt{tdtGxG} also provides the possibility to perform a likelihood ratio test comparing a conditional logistic regression model containing one parameter for each SNP and one parameter for the interaction of these two SNPs with a model only consisting of the two parameters for the main effects of the SNPs (\texttt{test = "lrt"}), where the genetic mode of inheritance assumed for the SNPs can be specified by the argument \texttt{mode} (by default, an additive mode is assumed). Furthermore, it is also possible to perform a Wald test for the interaction term either by considering a conditional logistic regression model either composed of one parameter for each SNP and one parameter for the interaction (\texttt{test = "full"}) or consisting of just one parameter for the interaction (\texttt{test = "screen"}). Thus, if the most simplest of these tests should be performed, then this could be done by <<>>= tdtGxG(mat.test[,1], mat.test[,2], test="screen") @ \subsection{Testing all SNPs in a Matrix in Genotype Format with a Genotypic TDT} All SNPs represented by the columns of a matrix in genotype format can be tested with a gTDT by employing the function \texttt{colTDT}. Thus, all SNPs in \texttt{mat.test} can be tested by calling <<>>= tdt.out <- colTDT(mat.test) tdt.out @ By default, the five top SNPs, i.e.\ the five SNPs with the lowest $p$-values, are shown ordered by their significance. The top three SNPs can be shown by <<>>= print(tdt.out, 3) @ If the integer specified in \texttt{print} is larger than or equal to the number of SNPs in the input matrix, the statistics for all SNPs are displayed in the order of their appearance in this matrix. <<>>= print(tdt.out, 10) @ \subsection{Performing a MAX Test} Since the genetic mode of inheritance is typically unknown, it might be beneficial to use the maximum over the gTDT statistics for an additive, a dominant, and a recessive effect as test statistic, which can be done using the function \texttt{colTDTmaxStat} <<>>= max.stat <- colTDTmaxStat(mat.test) max.stat @ This function just computes the MAX gTDT statistic, i.e.\ the maximum over the three gTDT statistics, since in contrast to these gTDT statistics, which under the null hypothesis follow an asymptotic $\chi^2_1$-distribution, the null distribution of the MAX gTDT statistic is unknown, and must therefore be estimated by a (time-consuming) permutation procedure. To also determine permutation-based p-values, \texttt{colTDTmaxTest} can be applied to a matrix in genotype matrix. For example, <<>>= max.out <- colTDTmaxTest(mat.test, perm=1000) @ computes p-values for the six SNPs in \texttt{mat.test} based on 1000 permutations of the case-pseudo-control status. <<>>= max.out @ \subsection{Testing all Pairs of SNPs in a Matrix in Genotype Format} All two-way interactions comprised a matrix in genotype format can be tested using the function \texttt{colGxG}. Since both the gTDT for two-way interactions and the likelihood ratio test of \citet{cordell2004} assume that the two considered loci are unlinked, the testing might fail, i.e.\ the fitting of the conditional logistic regression might not work pro\-per\-ly, if the two SNPs are in (strong) LD. There are several other reasons why this might happen. One of these reasons is that either the minor allele frequencies of both SNPs are very small or that the number of trios influencing the parameter estimation becomes very small when considering the two SNPs in combination. Therefore, \texttt{colGxG} provides an argument called \texttt{genes} that allows specifying which SNP belongs to which LD-block, gene, or genetic region. If \texttt{genes} is not specified, the interactions between all $m(m-1)/2$ pairs of the $m$ SNPs in a matrix are tested. If specified, only the interactions between SNPs showing different values of \texttt{genes} are tested. If we thus assume that the first two SNPs in \texttt{mat.test} belong to gene \texttt{G1} and the other four SNPs to \texttt{G2} <<>>= genes <- paste("G", rep(1:2, c(2,4)), sep="") genes @ then only the four interactions between \texttt{SNP1} and each SNP from gene \texttt{G2}, as well as the four interactions between \texttt{SNP2} and each SNP from gene \texttt{G2} are tested, when calling <<>>= tdt2.out <- colGxG(mat.test, genes=genes) tdt2.out @ Again, by default the top five SNP interactions are shown. The statistics for all eight interactions can be displayed by calling <<>>= print(tdt2.out, 8) @ \subsection{Testing Gene-Environment Interactions with a Genotypic TDT} In genetic association studies, it is often also of interest to test gene-environment interactions, where most of the usually considered environmental variables are binary. The \texttt{R} package \texttt{trio} therefore also provides a function called \texttt{colGxE} to test the interactions between each of the SNPs comprised by a matrix in genotype format and a binary environmental variable with values zero and one. If we, for example, assume that the children in the first 50 trios comprised by (the first 150 rows of) \texttt{mat.test} are girls, and the remaining 50 are boys, <<>>= sex <- rep(0:1, e=50) @ then we can test the interactions between the six SNPs in \texttt{mat.test} and the environmental variable ``sex" by <<>>= gxe.out <- colGxE(mat.test, sex) gxe.out @ In this situation, a conditional logistic regression model $\beta_\text{G} G + \beta_\text{GxE} (G\times E)$ is fitted for each SNP, where $G$ is a variable coding for an additive effect of the SNP, and $G\times E$ is the corresponding gene-environment interaction. Analogously to the other gTDT functions, a dominant or a recessive effect can also be considered by changing the argument \texttt{model} of \texttt{colGxE}. For both $\beta_\text{G}$ and $\beta_\text{GxE}$, the same as statistics as, for example, in \texttt{colTDT}, are computed. Additionally, the relative risks and their confidence intervals for the exposed trios are determined (note that the relative risks for the unexposed trios are given by $\exp\left(\beta_\text{G}\right)$), and a 2 degree of freedom Wald test for testing both $\beta_\text{G}$ and $\beta_\text{GxE}$ simultaneously as well as two likelihood ratio tests are performed, where the 2 df likelihood ratio test compares the likelihood of the full model (containing both $\beta_\text{G}$ and $\beta_\text{GxE}$) with the likelihood of a null model containing no variable, and the 1 df likelihood ratio test compares the likelihoods of the full model and a model only consisting of the SNP. The computation of all these statistics can be avoided by setting \texttt{addGandE = FALSE} (for the relative risk of the exposed cases), \texttt{add2df = FALSE} (for the 2 df Wald test) and \texttt{whichLRT = "none"} for the two likelihood ratio tests. If these statistics should be determined, but only the results for the genotypic TDT for testing the gene-environment interaction should be printed, then this can be done by calling <<>>= print(gxe.out, onlyGxE=TRUE) @ A convenient way to generate a data frame containing all the statistics determined by \texttt{colGxE} is to use the function \texttt{getGxEstats}. This function can be employed to obtain these results for all considered SNPs, and it can also be used to get these statistics only for the top SNPs. If, for example, the results of the three SNPs showing the smallest p-values for the 2 df likelihood ratio test are of interest, then the data frame containing these SNPs can be generated by <<>>= dat.top3 <- getGxEstats(gxe.out, top=3, sortBy="lrt2df") dat.top3 @ \section{Allelic TDT}\label{allelicTDT} Besides functions for performing genotypic TDTs, \texttt{trio} also provides functions for applying an allelic TDT and score tests related to the genotypic TDTs to matrices in genotype format. For example, the case-parent trio data in \texttt{mat.test} can be analyzed with an allelic TDT by <<>>= a.out <- allelicTDT(mat.test) a.out @ By default, McNemar's test statistic \begin{equation*} a=\dfrac{(b-c)^2}{b+c} \end{equation*} is used as test statistic in \texttt{allelicTDT}, where $b$ and $c$ are the off-diagonal elements of the 2x2-table summarizing the transmitted and untransmitted alleles from heterozygous parents, i.e.\ $b$ is the number of heterozygous parents that transmitted the minor allele to their respective children, and $c$ the number of heterozygous parents that transmitted the major allele. Alternatively, a version of McNemar's test statistic corrected for continuity, namely \begin{equation*} a_\text{Cor}= \dfrac{(|b-c|-1)^2}{b+c} \end{equation*} can be used by calling <<>>= a.out2 <- allelicTDT(mat.test, correct=TRUE) a.out2 @ \section{Score Tests}\label{score} If a score test instead of a Wald test, i.e.\ a genotypic TDT, should be considered in the analysis of genotype data, then \texttt{scoreTDT}, \texttt{scoreMaxStat}, \texttt{scoreGxG}, and \texttt{scoreGxE} can be used instead of and in the same way as \texttt{colTDT}, \texttt{colTDTmaxStat}, \texttt{colGxG}, and \texttt{colGxE}, respectively. For example, the SNPs in \texttt{mat.test} can be tested individually under the assumption of an additive mode of inheritance by <<>>= s.out <- scoreTDT(mat.test) s.out @ Score tests, however, only provide scores and p-values, but do not the allow the computation of odds ratios, relative risks, and confidence intervals. Usually, score tests have the advantage that their test statistic is much faster too compute than the statistic of a Wald test. However, when considering SNPs or interactions between SNPs and binary environmental factors, both the score tests and the genotypic TDTs have about the same computation time, as in these situations, there also exist analytic solutions for the genotypic TDT statistics \citep[see][]{schwender2012}. \section{Generating Data for Trio Logic Regression Input}\label{trio} If interactions of a higher order than two are of interest, trio logic regression can be used to detect disease-associated SNP interactions of any order. To generate data that can be used as input in trio logic regression, the sequential application of two functions is required. The function \texttt{trio.check} evaluates whether or not Mendelian errors are present in the data (stored either in linkage or in genotype format, see Section \ref{supported}). If no Mendelian inconsistencies are detected, this function creates an object that is passed to the function \texttt{trio.prepare}. The latter function then generates a matrix of the genotype information for the affected probands and the inferred pseudo-controls, taking the observed LD structure into account. Missing data are imputed in the process. The user, however, has to supply the information for the lengths of the LD blocks. A function called \texttt{findLDblocks} for identifying LD blocks, and thus, for specifying the length of the blocks is therefore also contained in this package (see Section \ref{getLD}). Given the lengths of the LD blocks, the haplotype frequencies can be estimated, using the function \texttt{haplo.em} in the \texttt{haplo.stat} package. \subsection{Supported File Formats and Elementary Data Processing}\label{supported} In this section, we show how to generate data suitable for input to trio logic regression from complete pedigree data without Mendelian errors. The function \texttt{trio.check} requires that the trio data are already available as a data frame or matrix, either in linkage/ped format (the default), or in genotype format (for reading a ped file into \texttt{R}, see Section \ref{testing}). The first function used is always \texttt{trio.check}. Unless otherwise specified, this function assumes that the data are in linkage format. If no Mendelian inconsistencies in the data provided are identified, \texttt{trio.check} creates an object that can be processed in the subsequent analysis with this package. The genotype information for each SNP will be converted into a single variable, denoting the number of variant alleles. If we thus would like to check whether the data frame \texttt{trio.ped1} contains Mendelian errors, we call <<>>= trio.tmp <- trio.check(dat=trio.ped1) str(trio.tmp, max=1) trio.tmp$trio[1:6,] @ Taking the LD structure of the SNPs into account is imperative when creating the genotypes for the pseudo-controls. This requires information on the LD blocks. However, there are many ways to delineate this block structure, and in the absence of a consensus what the best approach is, researchers have different preferences, and thus, results can be different. %It is therefore assumed that the user has already %delineated the block structure according to his or her method of %choice (assumed to be 1, 4, 2, and 3 in the following examples). In the function \texttt{findLDblocks}, a modified version of the method of \cite{gabriel2002} has been implemented, which can be used to specify the block structure by \begin{verbatim} > table(foundBlocks$blocks} \end{verbatim} if \texttt{foundBlocks} is the output of \texttt{findLDblocks} (for details, see Section \ref{getLD}). The function \texttt{trio.prepare}, which operates on an output object of \text{trio.check}, accepts the block length information as an argument (in the following, we assume that the block structure is given by \texttt{c(1, 4, 2, 3)}, i.e.\ the first block consists only of the first SNP, the second block of the next four SNPs, the third of the following two SNPs, and the last block of the remaining three SNPs). If this argument is not specified, a uniform block length of 1 (i.e.\ no LD structure) is assumed. If the haplotype frequencies are not specified, they are estimated from the parents' genotypes (more information on this in the following sections). The function \texttt{trio.prepare} then returns a list that contains the genotype information in binary format, suitable as input for trio logic regression: \texttt{bin} is a matrix with the conditional logistic regression response in the first columns, and each SNP as two binary variables using dominant and recessive coding. The list element \texttt{miss} contains information about missing values in the original data, and \texttt{freq} contains information on the estimated haplotype frequencies. To make the matrix generated by \texttt{trio.prepare} reproducible, the function \texttt{set.seed} is used to set the random number generator in a reproducible state. <<>>= set.seed(123456) trio.bin <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3)) str(trio.bin, max=1) trio.bin$bin[1:8,] @ As mentioned above, the \texttt{trio} package also accommodates trio genotype data. The object \texttt{trio.gen1}, available in the \texttt{R} package, is an example of such a data set. Equivalent to \texttt{trio.ped1} used above, it contains information for 10 SNPs in 100 trios. When used in \texttt{trio.check}, the argument \texttt{is.linkage} needs to be set to \texttt{FALSE}. The output from this function is then identical to the one shown derived from the linkage file, and can be passed to the function \texttt{trio.prepare}. <<>>= str(trio.gen1) trio.gen1[1:10,1:12] trio.tmp <- trio.check(dat=trio.gen1, is.linkage=FALSE) set.seed(123456) trio.bin2 <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3)) trio.bin2$bin[1:8,] @ \subsection{Missing Genotype Information}\label{missing} Missing genotypes in ped(igree) files are typically encoded using the integer 0. The data files can be processed as before if they contain such missing values: <<>>= str(trio.ped2) trio.tmp <- trio.check(dat=trio.ped2) trio.tmp$trio[1:6,] @ Since trio logic regression requires complete data, the function \texttt{trio.prepare} also performs an imputation of the missing genotypes. The imputation is based on estimated haplotypes, using the block length information specified by the user. In a later section we demonstrate how this imputation can be run more efficiently when haplotype frequency estimates are already available. <<>>= set.seed(123456) trio.bin3 <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3)) trio.bin3$bin[1:8,] @ Missing data in genotypes files should be encoded using \texttt{NA}, the conventional symbol in \texttt{R} to indicate missing values. <<>>= str(trio.gen2) trio.tmp <- trio.check(dat=trio.gen2, is.linkage=FALSE) set.seed(123456) trio.bin4 <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3)) trio.bin4$bin[1:8,] @ As the user might also be interested in the completed genotype data in the original format (genotype or linkage file), the function \texttt{trio.prepare} also allows for this option by using the argument \texttt{logic = FALSE}. In the resulting object, the matrix \texttt{bin} is then replaced by the data frame \texttt{trio}, and \texttt{miss} and \texttt{freq} are also returned. <<>>= trio.tmp <- trio.check(dat=trio.gen2, is.linkage=FALSE) set.seed(123456) trio.imp <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3), logic=FALSE) str(trio.imp, max=1) trio.imp$miss[c(1:6),] trio.gen2[1:6,] trio.imp$trio[1:6,] @ The same applies to pedigree data: <<>>= trio.tmp <- trio.check(dat=trio.ped2) set.seed(123456) trio.imp2 <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3), logic=FALSE) trio.imp$trio[1:6,] @ \subsection{Mendelian Errors}\label{mendelian} To delineate the genotype information for the pseudo-controls, the trio data must not contain any Mendelian errors. The function \texttt{trio.check} returns a warning, and an \texttt{R} object with relevant information when Mendelian errors are encountered is created. <<>>= trio.tmp <- trio.check(dat=trio.ped.err) str(trio.tmp, max=1) trio.tmp$errors @ In this data set, trio 1, for example, contains two Mendelian errors, in SNPs 9 and 10. <<>>= trio.tmp$trio.err[1:3, c(1,2, 11:12)] trio.ped.err[1:3,c(1:2, 23:26)] @ It is the user's responsibility to find the cause for the Mendelian errors and correct those, if possible. However, Mendelian inconsistencies are often due to genotyping errors and thus, it might not be possible to correct those in a very straight-forward manner. In this instance, the user might want to encode the genotypes that cause theses Mendelian errors in some of the trios as missing data. The argument \texttt{replace = TRUE} in \texttt{trio.check} allows for this possibility. The resulting missing data can then be imputed as described in the previous section. <<>>= trio.rep <- trio.check(dat=trio.ped.err, replace=TRUE) str(trio.rep, max=1) trio.rep$trio[1:3,11:12] @ The same option is available for data in genotype format with Mendelian inconsistencies. <<>>= trio.tmp <- trio.check(dat=trio.gen.err, is.linkage=FALSE) trio.tmp$errors trio.tmp$trio.err[1:6, c(1,2,7), drop=F] trio.rep <- trio.check(dat=trio.gen.err, is.linkage=FALSE, replace=TRUE) trio.rep$trio[1:6,c(1,2,7)] @ \subsection{Using Haplotype Frequencies} As mentioned above, when estimates for the haplotype frequencies are already available, they can be used in the imputation of missing data and the delineation of the pseudo-controls. In case there are blocks of length one, i.e.\ SNPs not belonging to any LD blocks, the minor allele frequencies of those SNPs are supplied. In this case, no haplotype estimation is required when the function \texttt{trio.prepare} is run, which can result in substantial time savings. As an example for the format of a file containing haplotype frequency estimates and SNP minor allele frequencies, the object \texttt{freq.hap} is available in the \texttt{R} package: <<>>= str(freq.hap) freq.hap[1:6,] @ We can now impute the missing genotypes using these underlying haplotype frequencies. <<>>= trio.tmp <- trio.check(dat=trio.gen2, is.linkage=FALSE) set.seed(123456) trio.imp3 <- trio.prepare(trio.dat=trio.tmp, freq=freq.hap, logic=FALSE) str(trio.imp3, max=1) trio.gen2[1:6,] trio.imp3$trio[1:6,] @ \section{Trio Logic Regression}\label{triolr} After having prepared a matrix suitable for a trio logic regression analysis, the function \texttt{trioLR} can be used to perform this analysis. \subsection{Parameter Settings for Trio Logic Regression} To ensure that the following examples are fast to run, we here only use 1000 iterations, i.e.\ the minimum number of iterations allowed, in the stochastic search algorithms, simulated annealing and MCMC, employed in trio logic regression. \textit{\bf In a real trio logic regression analysis, at least a few hundred thousands of iterations should be used, where the number of iterations in particular depends on the number of variables considered in the analysis.} The parameter \texttt{iter} for setting the number of iterations along with several other control parameters for the two stochastic search algorithms and the logic tree, i.e.\ the logic expression, grown in a trio logic regression can be specified by employing the argument \texttt{control} of the function \texttt{trioLR}, which in turn should be specified via the function \texttt{lrControl}. Here, we set <<>>= my.control <- lrControl(start=1, end=-3, iter=1000, output=-4) @ where \texttt{start} and \texttt{end} are the starting and end temperature on log10-scale in simulated annealing, where the temperature governs how likely it is that the model proposed in an iteration of trio logic regression is accepted although it actually performs worse than the model accepted in the previous iteration (at the beginning, this acceptance probability is comparatively high so that many models are visited/accepted, whereas in the last iterations, this probability is very low so that virtually only models are accepted if they are better than the current trio logic regression model). Again, it is very hard to give a good recommendation how to choose \texttt{start} and \texttt{end} (if they are not specified, the algorithm tries to find good choices for these two parameters, which, however, might works suboptimal), as their optimal specification highly depends on the data at hand. In our experience, \texttt{start = 1} and \texttt{end = -3} are often reasonable choices for a trio logic regression, but for a particular case-parent trio data set other choices might work better. The help page for the function \texttt{logreg.anneal.control} in the \texttt{R} package \texttt{LogicReg} gives a comprehensive introduction on how the control parameters of simulated annealing can be chosen. Finally, we have set \texttt{output = -4}, so that the models visited during the search for the best trio logic regression model with an MCMC algorithm are not stored in file, but all statistics for individual SNPs, pairs of SNPs, and triplets of SNPs are computed. \subsection{Performing a Trio Logic Regression Analysis} Employing this specification of these four parameters and the defaults for the other control parameters, \texttt{trioLR} can be directly applied to the output of \texttt{trio.prepare}, i.e.\ for example \texttt{trio.bin} (see Section \ref{missing}), by <<>>= lr.out <- trioLR(trio.bin, control=my.control, rand=9876543) lr.out @ where we specify the argument \texttt{rand} to set the random number generator in a reproducible state. By default, the logic expression found by trio logic regression is printed as it has been found. Alternatively, it can also be printed in disjunctive normal form <<>>= print(lr.out, asDNF=TRUE) @ This has the advantage that the interactions -- in particular, in a statistical sense -- contained in the logic expression are directly given by the AND-combinations (\&) in this disjunctive normal form. In the example, this advantage admittedly does not really exist, but when a much larger and more complex logic tree is grown, the logic expression represented by this logic tree can be very hard to interpret. It is also possible to construct the matrix \texttt{x} containing the values of the cases and the matched pseudo-controls and the vector \texttt{y} comprising the class labels of the (artificial) subjects without using \texttt{trio.prepare}. In this situation, each column of \texttt{x} must represent one of the logic/binary variable, coded by zeros and ones, and each row must correspond to a case or a pseudo-control, where each block of four consecutive rows has to consist of the data for a case and the three matched pseudo-controls (in this order). In \texttt{y}, each case must be coded by a \texttt{3} and each pseudo-control by a \texttt{0} so that \texttt{y} is given by <<>>= n.trios <- 100 y <- rep(c(3, 0, 0, 0), n.trios) @ where \texttt{n.trios} is the number of case-parent trios considered in a study. This number is here set to \texttt{100}, as our example data set \texttt{trio.bin} consists of 100 trios. To avoid the construction of the matrix \texttt{x}, we extract this matrix for this example from \texttt{trio.bin}. <<>>= x <- trio.bin$bin[,-1] @ The same trio regression analysis as above can then be performed by <<>>= lr.out2 <- trioLR(x, y, control=my.control, rand=9876543) lr.out2 @ \subsection{Permutation Tests for the Trio Logic Regression Model} The trio logic regression model resulting from such an application can then be tested by a null-model permutation procedure, which checks whether there is signal in the data, or by a conditional permutation test for model selection. Both can be performed using the function \texttt{trio.permTest} in which the argument \texttt{conditional} specifies which of these permutation tests are done. Thus, the null-model permutation test with \texttt{n.perm = 20} permutations can be applied to the case-parent trio data in \texttt{trio.bin} by \begin{verbatim} > trio.permTest(lr.out, n.perm=20) \end{verbatim} while the conditional permutation test can be performed by \begin{verbatim} > trio.permTest(lr.out, conditional=TRUE, n.perm=20) \end{verbatim} \subsection{Fitting Several Trio Logic Regression Models} By default, (trio) logic regression uses simulated annealing as search algorithm, and \texttt{trioLR} fits one trio logic regression model comprising one logic tree with a maximum of five leaves, i.e.\ five binary variables coded by zeros and ones Thus, the argument \texttt{nleaves} is by default set to \texttt{nleaves = 5}, whereas the number of logic trees cannot be changed in a trio logic regression. Alternatively, several trio logic regression models with different maximum numbers of leaves/variables can be fitted by setting \texttt{nleaves} to a vector of length 2, where the first element specifies the lowest maximum number, and the second element the largest maximum number. If thus, for example, trio logic regression models should be fitted in which the maximum number of leaves varies between 3 and 5, then this can be done by \begin{figure}[!t] \centerline{\includegraphics[width=0.7\textwidth]{figure1lr}} \vspace*{-8pt} \caption{Logic tree built in a trio logic regression analysis of case-parent trio data in which a maximum of five leaves was allowed.}\label{fig:tree} \end{figure} <<>>= lr.out3 <- trioLR(trio.bin, nleaves=c(3,5), control=my.control, rand=9876543) lr.out3 @ \subsection{Plotting Trio Logic Regression Models} The resulting logic trees cannot only be printed, but also plotted. If, for example, the logic tree in the third trio logic regression model (i.e.\ the model with a maximum of five leaves) should be plotted, then this can be done by \begin{verbatim} plot(lr.out3, whichTree=3) \end{verbatim} (see Figure \ref{fig:tree}), where the argument \texttt{whichTree} needs only to be specified if several models have been fitted. \subsection{Greedy Search in Trio Logic Regression} Alternatively to simulated annealing, a greedy search can be employed in trio logic regression by changing the argument \texttt{search} to \texttt{"greedy"}. <<>>= lr.out4 <- trioLR(trio.bin, search="greedy", rand=9876543) lr.out4 @ \subsection{MC Trio Logic Regression} While trio logic regression based on simulated annealing or a greedy search tries to identify the logic expression, and thus, the trio logic regression model, that provides the best prediction for the disease risk, the main goal of trio MC logic regression, i.e.\ trio logic regression based on MCMC is the specification of the individual relevance of SNPs and the joint importance of pairs and triplets of SNPs for disease risk by counting how often the individual SNPs occur in the models visited (and accepted) during the MCMC search and how often pairs and triplets of SNPs occur jointly in these models \citep[cf.][]{kooperberg2005}. This trio MC logic regression can be performed by <<>>= lr.out5 <- trioLR(trio.bin, search="mcmc", control=my.control, rand=9876543) lr.out5 @ \begin{figure}[!t] \centerline{\includegraphics[width=0.7\textwidth]{figure2lr}} \vspace*{-8pt} \caption{Visualization of the results of a trio MC logic regression analysis for the individual SNPs.}\label{fig:mc} \end{figure} The results of this analysis can be visualized by plotting for each SNP the percentages of visited models in which this SNP occurs (\texttt{freqType = 1}; the default), and for each pair of SNPs in how many of the visited models this pair occurs jointly (\texttt{freqType = 2}) and the observed-to-expected ratio of being jointly in the models (\texttt{freqType = 3}). For example, in Figure \ref{fig:mc}, \begin{verbatim} plot(lr.out5, freqType=1, useNames=TRUE) \end{verbatim} the frequencies for the individual SNPs are displayed. \begin{figure}[!t] \centerline{\includegraphics[width=0.7\textwidth]{figure3lr}} \vspace*{-8pt} \caption{Values of the importance measure of trioFS for the five most important identified SNP interactions.}\label{fig:fs} \end{figure} \section{Analysis of Trio Data with trioFS}\label{triofs} Another way to quantify the importance of SNPs and SNP interactions for the prediction of the disease risk is to apply trio logic regression to several subsets of the case-parent trio data and then employ the resulting trio logic regression models to compute importance measures for the interactions found. This procedure called trioFS (trio Feature Selection) is implemented in the function \texttt{trioFS} that can be used in a similar way as \texttt{trioLR}. For example, trioFS can be applied to the case-parent trio data by <<>>= fs.out <- trioFS(trio.bin, B=5, control=my.control, rand=9876543) fs.out @ where we here consider only \texttt{B = 5} subsets of the data, while in a real analysis, at least \texttt{B = 20} subsets should be considered. By default, simulated annealing is used in this search. Alternatively, it is also possible to do a greedy search by setting the argument \texttt{fast} of \texttt{trioFS} to \texttt{TRUE}. The importances of the identified SNP interactions cannot only be printed, but also be plotted. \begin{verbatim} plot(fs.out) \end{verbatim} (see Figure \ref{fig:fs}). \section{Detection of LD Blocks}\label{getLD} For the estimation of the haplotype structure that might be used in the \texttt{R} function \texttt{trio.prepare}, this package also includes functions for the fast computation of the pairwise $D^\prime$ and $r^2$ values for hundreds or thousands of SNPs, and for the identification of LD blocks in these genotype data using a modified version of the algorithm proposed by \citet{gabriel2002}. For the latter, it is assumed that the SNPs are ordered by their position on the chromosomes. These functions are not restricted to trio data, but can also be applied to population-based data. The only argument of these functions specifically included for trio data is \texttt{parentsOnly}. If set to \texttt{TRUE}, only the genotypes of the parents are used in the determination of the pairwise values of the LD measures and the estimation of the LD blocks. Furthermore, each parent is only considered once so that parents with more than one offspring do not bias the estimations. If trio data is used as input, the functions assume that the matrix containing the SNP data is in genotype format. \subsection{Computing Values of LD Measures} Here, we consider a simulated matrix \texttt{LDdata} from a population-based study. Thus, all subjects are assumed to be unrelated. This matrix contains simulated genotype data for 10 LD blocks each consisting of 5 SNPs each typed on 500 subjects. The pairwise $D^\prime$ and $r^2$ values for the SNPs in this matrix can be computed by <<>>= ld.out <- getLD(LDdata, asMatrix=TRUE) @ where by the default these values are stored in vectors to save memory. If \texttt{asMatrix} is set to \texttt{TRUE}, the values will be stored in matrices. The pairwise LD values for the first 10 SNPs (rounded to the second digit) can be displayed by <<>>= round(ld.out$Dprime[1:10,1:10], 2) @ <<>>= round(ld.out$rSquare[1:10,1:10], 2) @ and the pairwise LD plot for all SNPs can be generated by \begin{figure}[!b] \centerline{\includegraphics[width=0.5\textwidth]{figure1ld}} \caption{Pairwise $r^2$ values for the SNPs from \texttt{LDdata}.}\label{fig:1} \end{figure} \begin{verbatim} > plot(ld.out) \end{verbatim} (see Figure \ref{fig:1}). This figure shows the $r^2$-values. The $D^\prime$ values can be plotted by \begin{verbatim} > plot(ld.out, "Dprime") \end{verbatim} (not shown). \subsection{Estimating LD Blocks} The LD blocks in genotype data can be identified using the modified algorithm of \citet{gabriel2002} by calling <<>>= blocks <- findLDblocks(LDdata) blocks @ Alternatively, the output of \texttt{getLD} can be used when \texttt{addVarN} has been set to \texttt{TRUE} in \texttt{getLD} to store additional information on the pairwise LD values. <<>>= ld.out2 <- getLD(LDdata, addVarN=TRUE) blocks2 <- findLDblocks(ld.out2) blocks2 @ \begin{figure}[!t] \centerline{\includegraphics[width=0.5\textwidth]{figure2ld}} \caption{LD blocks as found by the modified algorithm of \citet{gabriel2002}. The borders of the LD blocks are marked by red lines. The color for the LD between each pair of SNPs is defined by the three categories used by \citet{gabriel2002} to define the LD blocks.}\label{fig:2} \end{figure} The blocks can also be plotted by \begin{verbatim} > plot(blocks) \end{verbatim} (see Figure \ref{fig:2}). In this figure, the borders of the LD blocks are marked by red lines. By default, the three categories used by the algorithm of \citet{gabriel2002} to define the LD blocks are displayed. Since this algorithm is based on the $D^\prime$ values, it is also possible to show these values in the LD (block) plot. \begin{figure}[!t] \centerline{\includegraphics[width=0.5\textwidth]{figure3ld}} \vspace*{-8pt} \caption{LD blocks as found by the modified algorithm of \citet{gabriel2002}. The borders of the LD blocks are marked by red lines. The darker the field for each pair of SNPs, the larger is the $D^\prime$ value for the corresponding SNP pair.}\label{fig:3} \end{figure} \begin{verbatim} > plot(blocks, "Dprime") \end{verbatim} (see Figure \ref{fig:3}). As mentioned in Section \ref{trio}, the haplotype structure required by \texttt{trio.prepare} can be obtained by <<>>= hap <- as.vector(table(blocks$blocks)) hap @ \section{Simulation}\label{simulation} The function \texttt{trio.sim} simulates case-parents trio data when the disease risk of children is specified by (possibly higher-order) SNP interactions. The mating tables and the respective sampling probabilities depend on the haplotype frequencies (or SNP minor allele frequencies when the SNP does not belong to a block). This information is specified in the \texttt{freq} argument of the function \texttt{trio.sim}. The probability of disease is assumed to be described by the logistic term $\text{logit}(p) = \alpha + \beta \times \text{Interaction}$, where $\alpha = \text{logit (\texttt{prev})}=\log( \frac{\text{\texttt{prev}}}{1-\text{\texttt{prev}}} )$ and $\beta = \log(\text{\texttt{OR}})$. The arguments \texttt{interaction, prev} and \texttt{OR}, are specified in the function \texttt{trio.sim}. Generating the mating tables and the respective sampling probabilities, in particular for higher order interactions, can be very CPU and memory intensive. We show how this information, once it has been generated, can be used for future simulations, and thus, speed up the simulations dramatically. \subsection{A Basic Example} We use the built-in object \texttt{simuBkMap} in a basic example to show how to simulate case-parent trios when the disease risk depends on (possibly higher order) SNP interactions. This file contains haplotype frequency information on 15 blocks with a total of 45 loci. In this example, we specify that the children with two variant alleles on SNP1 and two variant alleles on SNP5 have a higher disease risk. We assume that \texttt{prev = 0.001} and \texttt{OR = 2} in the logistic model specifying disease risk, and simulate a single replicate of 20 trios total. <<>>= str(simuBkMap) simuBkMap[1:7,] sim <- trio.sim(freq=simuBkMap, interaction="1R and 5R", prev=.001, OR=2, n=20, rep=1) str(sim) sim[[1]][1:6, 1:12] @ \subsection{Using Estimated Haplotype Frequencies} In this example we estimate the haplotype frequencies in the built-in data set \texttt{trio.gen1}, which contains genotypes for 10 SNPs in 100 trios. These estimated frequencies are then used to simulate 20 trios for the above specified disease risk model. <<>>= trio.tmp <- trio.check(dat=trio.gen1, is.linkage=FALSE) trio.impu <- trio.prepare(trio.dat=trio.tmp, blocks=c(1,4,2,3), logic=TRUE) str(trio.impu, max=2) trio.impu$freq[1:7,] sim <- trio.sim(freq=trio.impu$freq, interaction="1R and 5R", prev=.001, OR=2, n=20, rep=1) str(sim) sim[[1]][1:6, ] @ As before, the object containing the haplotype frequency information can also be generated from external haplotype frequencies and SNP minor allele frequencies. In the following example we specify the haplotype frequencies, and generate two replicates of ten trios each. <<>>= sim <- trio.sim(freq=freq.hap, interaction="1R or 4D", prev=.001, OR=2, n=10, rep=2) str(sim) sim[[1]][1:6,] @ \subsection{Using Step-Stones} Generating the mating tables and the respective sampling probabilities necessary to simulate case-parent trios can be very time consuming for interaction models involving three or more SNPs. In simulation studies, many replicates of similar data are usually required, and generating these sampling probabilities in each instance would be a large and avoidable computational burden (CPU and memory). The sampling probabilities depend foremost on the interaction term and the underlying haplotype frequencies, and as long as these remain constant in the simulation study, the mating table information and the sampling probabilities can be ``recycled.'' This is done by storing the relevant information (denoted as ``step-stone'') as a binary \texttt{R} file in the working directory, and loading the binary file again in future simulations, speeding up the simulation process dramatically. It is even possible to change the parameters \texttt{prev} and \texttt{OR} in these additional simulations, as the sampling probabilities can be adjusted accordingly. In the following example, we first simulate case-parent trios using a three-SNP interaction risk model, and save the step-stone object. We then simulate additional trios with a different parameter \texttt{OR}, using the previously generated information. <<>>= sim <- trio.sim(freq=freq.hap, interaction="1R or (6R and 10D)", prev=.001, OR=2, n=10, rep=1) str(sim) sim[[1]][1:6,] @ <<>>= sim <- trio.sim(freq=freq.hap, interaction="1R or (6R and 10D)", prev=.001, OR=3, n=10, rep=1, step.save="step3way") str(sim, max=1) sim[[1]][1:6,] @ \section*{Acknowledgments} Financial support was provided by DFG grants SCHW 1508/1-1, SCHW 1508/2-1, and SCHW 1508/3-1, as well as NIH grants R01 DK061662 and R01 HL090577. %\clearpage %\bibliography{xbib_appFinal} %\bibliographystyle{mypapers} \begin{thebibliography}{4} \bibitem[Cordell(2002)Cordell]{cordell2002} Cordell, H.J. (2002). \newblock Epistasis: What it Means, what it Doesn't Mean, and Statistical Methods to Detect it in Humans. \newblock {\em Human Molecular Genetics}, {\bf 11}, 2463--2468. \bibitem[Cordell {\em et al.}(2004)Cordell, Barratt, and Clayton]{cordell2004} Cordell, H.J., Barratt, B.J., and Clayton, D.G. (2004). \newblock Case/Pseudocontrol Analysis in Genetic Association Studies: A Unified Framework for Detection of Genotype and Haplotype Associations, Gene-Gene and Gene-Environment Interactions, and Parent-of-Origin Effects. \newblock {\em Genetic Epidemiology}, {\bf 26}, 167--185. \bibitem[Gabriel {\em et al.}(2002)Gabriel, Schaffner, Nguyen, Moore, Roy, Blumenstiel, Higgins, {DeFelice}, Lochner, Faggart, {Liu-Cordero}, Rotimi, Adeyemo, Cooper, Ward, Lander, Daly, and Altshuler]{gabriel2002} Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., {DeFelice}, M., Lochner, A., Faggart, M., {Liu-Cordero}, S.N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E.S., Daly, M.J., and Altshuler, D. (2002). \newblock The Structure of Haplotype Blocks in the Human Genome. \newblock {\em Science}, {\bf 296}, 2225--2229. \bibitem[Kooperberg and Ruczinski(2005)Kooperberg and Ruczinski]{kooperberg2005} Kooperberg, C. and Ruczinski, I. (2005). \newblock Identifying Interacting SNPs Using Monte Carlo Logic Regression. \newblock {\em Genetic Epidemiology}, {\bf 28}, 157--170. \bibitem[Li {\em et al.}(2010)Li, Fallin, Louis, Lasseter, {McGrath}, Avramopoulos, Wolyniec, Valle, Liang, Pulver, and Ruczinski]{li2010} Li, Q., Fallin, M.D., Louis, T.A., Lasseter, V.K., {McGrath}, J.A., Avramopoulos, D., Wolyniec, P.S., Valle, D., Liang, K.Y., Pulver, A.E., and Ruczinski, I. (2010). \newblock Detection of {SNP-SNP} Interactions in Trios of Parents with Schizophrenic Children. \newblock {\em Genetic Epidemiology}, {\bf 34}, 396--406. \bibitem[Schaid(1996)Schaid]{schaid1996} Schaid, D.J. (1996). \newblock General Score Tests for Associations of Genetic Markers with Disease Using Cases and Their Parents. \newblock {\em Genetic Epidemiology}, {\bf 13}, 423--449. \bibitem[Schwender {\em et al.}(2011a)Schwender, Bowers, Fallin, and Ruczinski]{schwender2011} Schwender, H., Bowers, K., Fallin, M.D., and Ruczinski, I. (2011). \newblock Importance Measure for Epistatic Interactions in Case-Parent Trios. \newblock {\emph Annals of Human Genetics}, {\bf 75}, 122--132. \bibitem[Schwender {\em et al.}(2011b)Schwender, Taub, Beaty, Marazita, and Ruczinski]{schwender2012} \newblock Schwender, H., Taub, M.A., Beaty, T.H., Marazita, M.L., and Ruczinski, I. (2011). \newblock Rapid Testing of SNPs and Gene-Environment Interactions in Case-Parent Trio Data Based on Exact Analytic Parameter Estimation. \newblock {\emph Biometrics}. DOI: 10.1111/j.1541-0420.2011.01713.x. \bibitem[Spielman {\em et al.}(1993)Spielman, McGinnis, and Ewens]{spielman1993} Spielman, R.S., McGinnis, R.E., and Ewens, W.J. (1993). \newblock Trsnmmission Test for Linkage Disequilibrium: The Insulin Gene Region and Insulin-Dependent Diabetes Mellitus (IDDM). \newblock {\em American Journal of Human Genetics}, {\bf 52}, 506--516. \end{thebibliography} \end{document}