%\VignetteIndexEntry{parathyroidGenesSE} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \usepackage{cite} \usepackage{color} \usepackage{url} \usepackage{fancyvrb} \usepackage[utf8]{inputenc} \setlength{\parindent}{0em} \setlength{\parskip}{.5em} \SweaveOpts{keep.source=TRUE,eps=FALSE,height=4.5,width=4} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}} \definecolor{darkgray}{gray}{0.2} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontsize=\small,xleftmargin=1em,formatcom={\color{darkgray}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small,xleftmargin=1em,frame=leftline,framerule=.6pt,rulecolor=\color{darkgray},framesep=1em,formatcom={\color{darkgray}}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \title{\textsf{\textbf{Creation of \Robject{parathyroidGenesSE}}}} \author{Michael Love} \begin{document} \maketitle \begin{abstract} This vignette describes the construction of the SummarizedExperiment \Robject{parathyroidGenesSE} in the \Rpackage{parathyroidSE} package. \end{abstract} \tableofcontents <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \section{Dataset description} We downloaded the RNA-Seq data from the publication of Haglund et al. \cite{Haglund2012Evidence}. The paired-end sequencing was performed on primary cultures from parathyroid tumors of 4 patients at 2 time points over 3 conditions (control, treatment with diarylpropionitrile (DPN) and treatment with 4-hydroxytamoxifen (OHT)). DPN is a selective estrogen receptor $\beta$ 1 agonist and OHT is a selective estrogen receptor modulator. One sample (patient 4, 24 hours, control) was omitted by the paper authors due to low quality. \section{Downloading the data} The raw sequencing data is publicly available from the NCBI Gene Expression Omnibus under accession number GSE37211\footnote{\url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211}}. The read sequences in FASTQ format were extracted from the NCBI short read archive file (.sra files), using the sra toolkit\footnote{\url{http://www.ncbi.nlm.nih.gov/books/NBK56560/}}. \section{Aligning reads} The sequenced reads in the FASTQ files were aligned using TopHat version 2.0.4\footnote{\url{http://tophat.cbcb.umd.edu/}} with default parameters to the GRCh37 human reference genome using the Bowtie index available at the Illumina iGenomes page\footnote{\url{http://tophat.cbcb.umd.edu/igenomes.html}}. The following code for the command line produces a directory for each run and indexes the BAM file (substituting the SRR number for \texttt{file}): \begin{Verbatim}[frame=single] tophat2 -o file_tophat_out -p 8 genome file_1.fastq file_2.fastq samtools index file_tophat_out/accepted_hits.bam \end{Verbatim} \section{Counting reads in genes} The genes were downloaded using the \Rpackage{GenomicFeatures} package from Ensembl release 69 on 5 February 2013. The \Rfunction{exonsBy} function produces a \Rclass{GRangesList} object of all exons grouped by gene. <>= library(GenomicFeatures) hse <- makeTranscriptDbFromBiomart(biomart="ensembl", dataset="hsapiens_gene_ensembl") exonsByGene <- exonsBy(hse, by="gene") @ For the vignette, we load a subset of these genes: <>= library("parathyroidSE") data(exonsByGene) @ For counting reads in genes, we used \Rfunction{summarizeOverlaps} from the \Rpackage{GenomicRanges} and \Rpackage{Rsamtools} packages. The following code demonstrates counting reads from 3 reduced BAM files over a subset of the Ensembl genes. The protocol is not strand specific, so we set \texttt{ignore.strand=TRUE}. We counted ``singletons'' as well, reads with an unmapped mate, and added these counts to produce a total. \fixme{Note: we have two versions of the counting code, first for Bioconductor version 2.12 and then for following Bioconductor releases.} <>= library(Rsamtools) bamDir <- system.file("extdata",package="parathyroidSE",mustWork=TRUE) fls <- list.files(bamDir, pattern="bam$",full=TRUE) bamlst <- BamFileList(fls) if (BiocInstaller:::biocVersion() == "2.12") { geneHitsPairs <- summarizeOverlaps(exonsByGene, bamlst, mode="Union", singleEnd=FALSE, ignore.strand=TRUE) geneHitsSingletons <- summarizeOverlaps(exonsByGene, bamlst, mode="Union", param=ScanBamParam(flag=scanBamFlag( isPaired=TRUE, hasUnmappedMate=TRUE)), singleEnd=TRUE, ignore.strand=TRUE) } else { geneHitsPairs <- summarizeOverlaps(exonsByGene, bamlst, mode="Union", singleEnd=FALSE, fragments=FALSE, ignore.strand=TRUE) geneHitsSingletons <- summarizeOverlaps(exonsByGene, bamlst, mode="Union", param=ScanBamParam(flag=scanBamFlag( isPaired=TRUE, hasUnmappedMate=TRUE)), singleEnd=TRUE, fragments=FALSE, ignore.strand=TRUE) } parathyroidGenesSE <- geneHitsPairs assay(parathyroidGenesSE) <- assay(geneHitsPairs) + assay(geneHitsSingletons) @ \section{Preparing exonic parts} For counting reads at the exon-level, we first prepared a \Rclass{GRanges} object which contains non-overlapping exonic parts. By comparing count levels across these exonic parts, we could infer cases of differential exon usage. The resulting exonic parts are identical to those produced by the python script distributed with the \Rpackage{DEXSeq} package (though the aggregated gene names might be in a different order). Note that some of the exonic parts have changed since the preparation of the \Rpackage{parathyroid} package due to the different Ensembl releases. We first retrieved the exon-by-transcript information to annotate exonic parts with transcript membership. <>= exonsByTranscript <- exonsBy(hse, by="tx", use.names=TRUE) @ For the vignette, we import a subset of these transcripts: <>= data(exonsByTranscript) @ Disjoining the exons into non-overlapping exonic parts: <>= exonicParts <- disjoin(unlist(exonsByGene)) @ Assigning exonic parts to aggregate genes: <>= foGG <- findOverlaps(exonsByGene, exonsByGene) splitByGene <- split(subjectHits(foGG), queryHits(foGG)) aggregateGeneNames <- sapply(splitByGene, function(i) paste(names(exonsByGene)[i],collapse="+")) foEG <- findOverlaps(exonicParts, exonsByGene, select="first") mcols(exonicParts)$aggregate_gene <- aggregateGeneNames[foEG] @ Assigning exonic parts to transcripts: <>= foET <- findOverlaps(exonicParts, exonsByTranscript) splitByExonicPart <- split(subjectHits(foET), queryHits(foET)) mcols(exonicParts)$transcripts <- sapply(splitByExonicPart, function(i) paste(names(exonsByTranscript)[i],collapse="+")) @ Sorting the exonic parts, and assigning numbers to each exonic part per aggregate gene: <>= exonicParts <- exonicParts[order(mcols(exonicParts)$aggregate_gene)] mcols(exonicParts)$exonic_part_number <- do.call(c,lapply(split(mcols(exonicParts)$aggregate_gene, mcols(exonicParts)$aggregate_gene), function(z) seq(along=z))) @ The resulting exonic parts look like: <>= exonicParts[101:103] @ \section{Counting reads in exonic parts} We used the \Rfunction{countOverlaps} function as a counting mode, in order to count all overlaps. Otherwise, paired-end reads and junction-spanning reads which hit more than one exonic part would not be counted. \fixme{Note: we have two versions of the counting code, first for Bioconductor version 2.12 and then for following Bioconductor releases.} <>= if (BiocInstaller:::biocVersion() == "2.12") { myco <- function(reads, features, ignore.strand) countOverlaps( features, reads, ignore.strand=ignore.strand) exonHitsPairs <- summarizeOverlaps(exonicParts, bamlst, mode=myco, singleEnd=FALSE, ignore.strand=TRUE) exonHitsSingletons <- summarizeOverlaps(exonicParts, bamlst, mode=myco, param=ScanBamParam(flag=scanBamFlag( isPaired=TRUE, hasUnmappedMate=TRUE)), singleEnd=TRUE, ignore.strand=TRUE) } else { myco <- function(features, reads, ignore.strand, inter.feature) countOverlaps( features, reads, ignore.strand=ignore.strand) exonHitsPairs <- summarizeOverlaps(exonicParts, bamlst, mode=myco, singleEnd=FALSE, fragments=FALSE, ignore.strand=TRUE) exonHitsSingletons <- summarizeOverlaps(exonicParts, bamlst, mode=myco, param=ScanBamParam(flag=scanBamFlag( isPaired=TRUE, hasUnmappedMate=TRUE)), singleEnd=TRUE, fragments=FALSE, ignore.strand=TRUE) } parathyroidExonsSE <- exonHitsPairs assay(parathyroidExonsSE) <- assay(exonHitsPairs) + assay(exonHitsSingletons) @ \section{Obtaining sample annotations from GEO} In order to provide phenotypic data for the samples, we used the \Rpackage{GEOquery} package to parse the series matrix file downloaded from the NCBI Gene Expression Omnibus under accession number GSE37211. We included this file as well in the package, and read it in locally in the code below. <>= library("GEOquery") gse37211 <- getGEO(filename=system.file("extdata/GSE37211_series_matrix.txt", package="parathyroidSE",mustWork=TRUE)) samples <- pData(gse37211)[,c("characteristics_ch1","characteristics_ch1.2", "characteristics_ch1.3","relation")] colnames(samples) <- c("patient","treatment","time","experiment") samples$patient <- sub("patient: (.+)","\\1",samples$patient) samples$treatment <- sub("agent: (.+)","\\1",samples$treatment) samples$time <- sub("time: (.+)","\\1",samples$time) samples$experiment <- sub("SRA: http://www.ncbi.nlm.nih.gov/sra\\?term=(.+)","\\1", samples$experiment) samples @ \section{Matching GEO experiments with SRA runs} The sample information from GEO must be matched to the individual runs from the Short Read Archive (the FASTQ files), as some samples are spread over multiple sequencing runs. The run information can be obtained from the Short Read Archive using the \Rpackage{SRAdb} package (note that the first step involves a large download of the SRA metadata database). We included the conversion table in the package. <>= library("SRAdb") sqlfile <- getSRAdbFile() sra_con <- dbConnect(SQLite(),sqlfile) conversion <- sraConvert(in_acc = samples$experiment, out_type = c("sra","submission","study","sample","experiment","run"), sra_con = sra_con) write.table(conversion,file="inst/extdata/conversion.txt") @ We used the \Rfunction{merge} function to match the sample annotations to the run information. We ordered the \Rclass{data.frame} \Robject{samplesFull} by the run number and then set all columns as factors. <>= conversion <- read.table(system.file("extdata/conversion.txt", package="parathyroidSE",mustWork=TRUE)) samplesFull <- merge(samples, conversion) samplesFull <- samplesFull[order(samplesFull$run),] samplesFull <- DataFrame(lapply(samplesFull, factor)) @ \section{Adding column data and experiment data} We combined the information from GEO and SRA to the \Rclass{SummarizedExperiment} object. First we extracted the run ID, contained in the names of the \Rclass{BamFileList} in the \Robject{fileName} column. We then ordered the rows of \Robject{samplesFull} to match the order of the run ID in \Robject{parathyroidGenesSE}, and removed the duplicate column of run ID. <>= colData(parathyroidGenesSE)$run <- sub(".*(SRR.*)_tophat_out.*","\\1", names(colData(parathyroidGenesSE)$fileName)) matchOrder <- match(colData(parathyroidGenesSE)$run, samplesFull$run) colData(parathyroidGenesSE) <- cbind(colData(parathyroidGenesSE), subset(samplesFull[matchOrder,],select=-run)) colData(parathyroidExonsSE)$run <- sub(".*(SRR.*)_tophat_out.*","\\1", names(colData(parathyroidExonsSE)$fileName)) matchOrder <- match(colData(parathyroidExonsSE)$run, samplesFull$run) colData(parathyroidExonsSE) <- cbind(colData(parathyroidExonsSE), subset(samplesFull[matchOrder,],select=-run)) @ We included experiment data and PubMed ID from the NCBI Gene Expression Omnibus. <>= exptData = new("MIAME", name="Felix Haglund", lab="Science for Life Laboratory Stockholm", contact="Mikael Huss", title="DPN and Tamoxifen treatments of parathyroid adenoma cells", url="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211", abstract="Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease.") pubMedIds(exptData) <- "23024189" exptData(parathyroidGenesSE) <- list(MIAME=exptData) exptData(parathyroidExonsSE) <- list(MIAME=exptData) @ Finally, we saved the object in the data directory of the package. <>= save(parathyroidGenesSE,file="data/parathyroidGenesSE.RData") save(parathyroidExonsSE,file="data/parathyroidExonsSE.RData") @ \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{library} \bibliographystyle{plain} \end{document}