This section gives an overview of the operations that can be performed on
a given set of metadata obtained particularly from data-rich objects such
as those obtained from curatedTCGAData. There are several operations that
work with microRNA, methylation, mutation, and assays that have gene symbol
annotations.
 CpGtoRanges
Using the methylation annotations in
IlluminaHumanMethylation450kanno.ilmn12.hg19 and the minfi package, we
look up CpG probes and convert to genomic coordinates with CpGtoRanges.
The function provides two assays, one with mapped probes and the other with
unmapped probes. Excluding unmapped probes can be done by setting the
unmapped argument to FALSE. This will run for both types of methylation
data (27k and 450k).
methcoad <- CpGtoRanges(coad)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
##   removing 535 sampleMap rows not in names(experiments)
 
 qreduceTCGA
The qreduceTCGA function converts RaggedExperiment mutation data objects
to RangedSummarizedExperiment using org.Hs.eg.db and the qreduceTCGA
utility function from RaggedExperiment to summarize ‘silent’ and ‘non-silent’
mutations based on a ‘Variant_Classification’ metadata column in the original
object.
It uses ‘hg19’ transcript database (‘TxDb’) package internally to summarize
regions using qreduceAssay. The current genome build (‘hg18’) in the data
must be translated to ‘hg19’.
In this example, we first set the appropriate build name in the mutation
dataset COAD_Mutation-20160128 according to the
NCBI website
and we then use seqlevelsStyle to match the UCSC style in the chain.
rag <- "COAD_Mutation-20160128"
# add the appropriate genome annotation
genome(coad[[rag]]) <- "NCBI36"
# change the style to UCSC
seqlevelsStyle(rowRanges(coad[[rag]])) <- "UCSC"
# inspect changes
seqlevels(rowRanges(coad[[rag]]))
##  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"
genome(coad[[rag]])
##   chr1   chr2   chr3   chr4   chr5   chr6   chr7   chr8   chr9  chr10  chr11 
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
##  chr12  chr13  chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22 
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" 
##   chrX   chrY 
## "hg18" "hg18"
Now we use liftOver from rtracklayer to translate ‘hg18’ builds
to ‘hg19’ using the downloaded chain file.
lifturl <-
"http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz"
bfc <- BiocFileCache()
qfile <- bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]]
cfile <-
if (length(qfile) && file.exists(qfile)) {
    bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]]
} else {
    bfcadd(bfc, "18to19chain", lifturl)
}
chainfile <- file.path(tempdir(), gsub("\\.gz", "", basename(cfile)))
R.utils::gunzip(cfile, destname = chainfile, remove = FALSE)
chain <- suppressMessages(
    rtracklayer::import.chain(chainfile)
)
ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain)
The same can be done to convert hg19 to hg38 (the same build that the
Genomic Data Commons uses) with the corresponding chain file:
liftchain <-
"http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz"
bfc <- BiocFileCache()
q38file <- bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]]
c38file <-
if (length(q38file) && file.exists(q38file)) {
    bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]]
} else {
    bfcadd(bfc, "19to38chain", liftchain)
}
cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(c38file)))
R.utils::gunzip(c38file, destname = cloc38, remove = FALSE)
chain38 <- suppressMessages(
    rtracklayer::import.chain(cloc38)
)
## then use the liftOver function using the 'chain38' object
## as above
ranges38 <- rtracklayer::liftOver(unlist(ranges19), chain38)
This will give us a list of ranges, each element corresponding to a single row
in the RaggedExperiment. We remove rows that had no matches in the liftOver
process and replace the ranges in the original RaggedExperiment with the
replacement method. Finally, we put the RaggedExperiment object back into the
MultiAssayExperiment.
re19 <- coad[[rag]][as.logical(lengths(ranges19))]
ranges19 <- unlist(ranges19)
genome(ranges19) <- "hg19"
rowRanges(re19) <- ranges19
# replacement
coad[["COAD_Mutation-20160128"]] <- re19
rowRanges(re19)
## GRanges object with 62523 ranges and 0 metadata columns:
##           seqnames              ranges strand
##              <Rle>           <IRanges>  <Rle>
##       [1]    chr20     1552407-1552408      +
##       [2]     chr1 161736152-161736153      +
##       [3]     chr7           100685895      +
##       [4]     chr7           103824453      +
##       [5]     chr7           104783644      +
##       ...      ...                 ...    ...
##   [62519]     chr9            36369716      +
##   [62520]     chr9            37692640      +
##   [62521]     chr9             6007456      +
##   [62522]     chrX           123785782      +
##   [62523]     chrX            51487184      +
##   -------
##   seqinfo: 24 sequences from hg19 genome; no seqlengths
Now that we have matching builds, we can finally run the qreduceTCGA function.
coad <- qreduceTCGA(coad, keep.assay = TRUE)
## 
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
 
 symbolsToRanges
In the cases where row annotations indicate gene symbols, the symbolsToRanges
utility function converts genes to genomic ranges and replaces existing
assays with RangedSummarizedExperiment objects. Gene annotations are given
as ‘hg19’ genomic regions.
symbolsToRanges(coad)
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:1 mapping between keys and columns
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
##   removing 363 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 11 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 11:
##  [1] COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 136 columns
##  [2] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns
##  [3] COAD_Mutation-20160128: RaggedExperiment with 62523 rows and 154 columns
##  [4] COAD_Methylation_methyl27-20160128: SummarizedExperiment with 27578 rows and 202 columns
##  [5] COAD_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 333 columns
##  [6] COAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22917 rows and 154 columns
##  [7] COAD_CNASeq-20160128_simplified: RangedSummarizedExperiment with 22917 rows and 136 columns
##  [8] COAD_mRNAArray-20160128_ranged: RangedSummarizedExperiment with 14222 rows and 172 columns
##  [9] COAD_mRNAArray-20160128_unranged: SummarizedExperiment with 3592 rows and 172 columns
##  [10] COAD_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17152 rows and 191 columns
##  [11] COAD_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3349 rows and 191 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save all data to files