% \VignetteIndexEntry{gmapR} % \VignetteKeywords{gmap,gsnap} % \VignettePackage{gmapR} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage{Sweave} \textwidth=6.5in \textheight=8.5in % \parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \title{gmapR: Use the GMAP Suite of Tools in R} \author{Michael Lawrence, Cory Barr} \date{\today} \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \Rpackage{gmapR} packages provides users with a way to access \software{GSNAP}, \software{bam\_tally}, and other utilities from the \software{GMAP} suite of tools within an R session. In this vignette, we briefly look at the \software{GMAP} suite of tools available through the \Rpackage{gmapR} package and work through an example. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{What is \software{GMAP}, \software{GSNAP}, and \software{bam\_tally}?} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \software{GMAP} suite offers useful tools for the following: \begin{itemize} \item Genomic mapping: Given a cDNA, find where it best aligns to an entire genome \item Genomic alignment: Given a cDNA and a genomic segment, provide a nucleotide-level correspondence for the exons of the cDNA to the genomic segment \item Summarization via coverage plus reference and allele nucleotide polymorphism counts for an aligned set of sequencing reads over a given genomic location \end{itemize} \software{GMAP} (Genomic Mapping and Alignment Program) is particularly suited to relatively long mRNA and EST sequences such as those that are obtained from Roche 454 or Pacific Biosciences sequencing technologies. (At present, only \software{GSNAP} is available through the \Rpackage{gmapR}. \software{GMAP} integration is scheduled for the near future.) \software{GSNAP} (Genomic Short-read Nucleotide Alignment Program) also provides users with genomic mapping and alignment capabilities, but is optimized to handle issues that arise when dealing with the alignment of short reads generated from sequencing technologies such as those from Illumina/Solexa or ABI/SOLiD. \software{GSNAP} offers the following functionality, as mentioned in \href{http://bioinformatics.oxfordjournals.org/content/26/7/873.full}{Fast and SNP-tolerant detection of complex variants and splicing in short reads} by Thomas D. Wu and Serban Nacu: \begin{itemize} \item fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index \item alignment of both single- and paired-end reads as short as 14 nt and of arbitrarily long length \item detection of short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites \item SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles \item alignment of reads from bisulfite-treated DNA for the study of methylation state \end{itemize} \software{bam\_tally} provides users with the coverage as well as counts of both reference alleles, alternative alleles, and indels over a genomic region. For more detailed information on the \software{GMAP} suite of tools including a detailed explication of algorithmic specifics, see \url{http://research-pub.gene.com/gmap/}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Create a \Rclass{GmapGenome} Object} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To align reads with \software{GMAP} or \software{GSNAP}, or to use \software{bam\_tally}, you will need to either obtain or create a \Rclass{GmapGenome} object. \Rclass{GmapGenome} objects can be created from FASTA files or \Rclass{BSgenome} objects, as the following example demonstrates: <>= library(gmapR) if (!suppressWarnings(require(BSgenome.Dmelanogaster.UCSC.dm3))) { source("http://bioconductor.org/biocLite.R") biocLite("BSgenome.Dmelanogaster.UCSC.dm3") library(BSgenome.Dmelanogaster.UCSC.dm3) } gmapGenomePath <- file.path(getwd(), "flyGenome") gmapGenomeDirectory <- GmapGenomeDirectory(gmapGenomePath, create = TRUE) ##> gmapGenomeDirectory ##GmapGenomeDirectory object ##path: /reshpcfs/home/coryba/projects/gmapR2/testGenome gmapGenome <- GmapGenome(genome=Dmelanogaster, directory=gmapGenomeDirectory, name="dm3", create=TRUE, k = 12L) ##> gmapGenome ##GmapGenome object ##genome: dm3 ##directory: /reshpcfs/home/coryba/projects/gmapR2/testGenome @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Aligning with \software{GSNAP}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The \software{GSNAP} algorithm incorporates biological knowledge to provide accurate alignments, particularly for RNA-seq data. In this section, we will align reads from an RNA-seq experiment provided in a fastq file to a selected region of the human genome. In this example, we will align to the region of the human genome containing TP53 plus an additional one megabase on each side of this gene. First we need to obtain the desired region of interest from the genome: <>= library("org.Hs.eg.db") library("TxDb.Hsapiens.UCSC.hg19.knownGene") eg <- org.Hs.eg.db::org.Hs.egSYMBOL2EG[["TP53"]] txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx <- transcripts(txdb, filter = list(gene_id = eg)) roi <- range(tx) + 1e6 strand(roi) <- "*" @ Next we get the genetic sequence and use it to create a GmapGenome object. (Please note that the TP53\_demo GmapGenome object is used by many examples and tests in the gmapR and VariationTools packages. If the object has been created before, its subsequent creation will be instantaneous.) <>= library("BSgenome.Hsapiens.UCSC.hg19") library("gmapR") p53Seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, roi, as.character = FALSE) names(p53Seq) <- "TP53" gmapGenome <- GmapGenome(genome = p53Seq, name = paste0("TP53_demo_", packageVersion("TxDb.Hsapiens.UCSC.hg19.knownGene")), create = TRUE, k = 12L) @ We add the known transcripts (splice sites) to the genome index: <>= exons <- gmapR:::subsetRegion(exonsBy(txdb), roi, "TP53") spliceSites(gmapGenome, "knownGene") <- exons @ The data package \software{LungCancerLines} contains fastqs of reads obtained from sequencing H1993 and H2073 cell lines. We will use these fastqs to demonstrate \software{GSNAP} alignment with \software{gmapR}. <>= library("LungCancerLines") fastqs <- LungCancerFastqFiles() @ \software{GSNAP} is highly configurable. Users create \Rclass{GsnapParam} objects to specify desired \software{GSNAP} behavior. <>= ##specify how GSNAP should behave using a GsnapParam object gsnapParam <- GsnapParam(genome=gmapGenome, unique_only=FALSE, suboptimal_levels=2L, npaths=10L, novelsplicing=TRUE, splicing="knownGene", indel_penalty=1L, distant_splice_penalty=1L, clip_overlap=TRUE) @ Now we are ready to align. <>= gsnapOutput <- gsnap(input_a=fastqs["H1993.first"], input_b=fastqs["H1993.last"], params=gsnapParam) ##gsnapOutput ##An object of class "GsnapOutput" ##Slot "path": ##[1] "/local/Rtmporwsvr" ## ##Slot "param": ##A GsnapParams object ##genome: dm3 (/reshpcfs/home/coryba/projects/gmapR2/testGenome) ##part: NULL ##batch: 2 ##max_mismatches: NULL ##suboptimal_levels: 0 ##snps: NULL ##mode: standard ##nthreads: 1 ##novelsplicing: FALSE ##splicing: NULL ##npaths: 10 ##quiet_if_excessive: FALSE ##nofails: FALSE ##split_output: TRUE ##extra: list() ## ##Slot "version": ## [1] NA NA NA NA NA NA NA NA NA NA NA NA ## @ The \Rcode{gsnapOutput} object will be of the class \Rclass{GsnapOutput}. It will provide access to the BAM files of alignments that \software{GSNAP} outputs along with other utilities. <>= ##> dir(path(gsnapOutput), full.names=TRUE, pattern="\\.bam$") ##[1] "/local/Rtmporwsvr/file1cbc73503e9.1.nomapping.bam" ##[2] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_mult.bam" ##[3] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_transloc.bam" ##[4] "/local/Rtmporwsvr/file1cbc73503e9.1.unpaired_uniq.bam" @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Using \Rmethod{bam\_tally}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running the \Rmethod{bam\_tally} method will return a \Rclass{GRanges} of information per nucleotide. Below is an example demonstrating how to find variants in the TP53 gene of the human genome. See the documentation for the \Rmethod{bam\_tally} method for more details. \software{gmapR} provides access to a demo genome for examples. This genome encompasses the TP53 gene along with a 1-megabase flanking region on each side. <>= genome <- TP53Genome() @ The \software{LungCancerLines} R package contains a BAM file of reads aligned to the TP53 region of the human genome. We'll use this file to demonstrate the use of \software{bam\_tally} through the \software{gmapR} package. The resulting data structure will contain the needed information such as number of alternative alleles, quality scores, and read position for each allele. The call to \Rfunction{bam\_tally} returns an opaque pointer to a C-level data structure. We anticipate having multiple means of summarizing these data into R data structures. For now, there is one: \Rfunction{variantSummary}, which returns a \Rcode{VRanges} object describing putative genetic variants in the sample. <>= bam_file <- system.file("extdata/H1993.analyzed.bam", package="LungCancerLines", mustWork=TRUE) breaks <- c(0L, 15L, 60L, 75L) bqual <- 56L mapq <- 13L param <- BamTallyParam(genome, minimum_mapq = mapq, concordant_only = FALSE, unique_only = FALSE, primary_only = FALSE, min_depth = 0L, variant_strand = 1L, ignore_query_Ns = TRUE, indels = FALSE, include_soft_clips = 1L, xs=TRUE) tallies <-bam_tally(bam_file, param) variantSummary(tallies, high_base_quality = 23L) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Creating a \Rclass{GmapGenome} Package} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% After creating a \Rclass{GmapGenome} object, you might want to distribute it for collaboration or version it for reproducible research. The function \Rcode{makeGmapGenomePackage} allows you to do this. Continuing on with the D. melanogaster example from above, here is how to archive the D. melanogaster \Rclass{GmapGenome} object in a \Rclass{GmapGenome} package: <>= makeGmapGenomePackage(gmapGenome=gmapGenome, version="1.0.0", maintainer="", author="Your Name", destDir="myDestDir", license="Artistic-2.0", pkgName="GmapGenome.Dmelanogaster.UCSC.dm3") @ After creating the package, you can run \Rcode{R CMD INSTALL myDestDir} to install it, or run \Rcode{R CMD build myDestDir} to create a distribution of the package. Many users with be working with the human genome. Many of the examples used in \software{gmapR} make use of a particular build of the human genome. As such, creating a \Rclass{GmapGenome} of hg19 is recommended. Here is one way to create it, using a \Rclass{BSgenome} object: <>= if (!suppressWarnings(require(BSgenome.Hsapiens.UCSC.hg19))) { source("http://bioconductor.org/biocLite.R") biocLite("BSgenome.Hsapiens.UCSC.hg19") library(BSgenome.Hsapiens.UCSC.hg19) } gmapGenome <- GmapGenome(genome=Hsapiens, directory = "Hsapiens", name = "hg19", create = TRUE) destDir <- "HsapiensGmapGenome" pkgName <- "GmapGenome.Hsapiens.UCSC.hg19" makeGmapGenomePackage(gmapGenome=gmapGenome, version="1.0.0", maintainer="", author="Your Name", destDir=destDir, license="Artistic-2.0", pkgName="GmapGenome.Hsapiens.UCSC.hg19") @ After running the above code, you should be able to run \Rcode{R CMD INSTALL GmapGenome.Hsapiens.UCSC.hg19} in the appropriate directory from the command line, submit GmapGenome.Hsapiens.UCSC.hg19 to a repository, etc. After installation, \Rcode{library("GmapGenome.Hsapiens.UCSC.hg19")} will load a \Rclass{GmapGenome} object that has the same name as the package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\section{Aligning with SNP Tolerance} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %Both \software{GSNAP} and \software{GMAP} have the ability to perform %alignment without penalizing against a set of known SNPs. In this %section, we work through an example of aligning with this SNP %tolerance. By using SNP-tolerance, we'll perform alignments without %penalizing reads that match SNPs contained in the provided SNPs. In %this case, we'll use dbSNP to provide the SNPs. % %This example will use hg19. If you have not installed %\Rcode{GmapGenome.Hsapiens.UCSC.hg19}, please refer to the section %\ref{create_hg19_GmapGenome}. % %First we load the Hsapiens \Rclass{GmapGenome} object: % %<<>= %library("GmapGenome.Hsapiens.UCSC.hg19") %@ <>= sessionInfo() @ \end{document}