Original Authors: Martin Morgan, Sonali Arora
Presenting Author: Martin Morgan
Date: 26 July, 2019
The goal of this section is to highlight practices for writing correct, understandable, robust and efficient R code.
identical(), all.equal())NA
values, …Profile
system.time() or the microbenchmark package.Rprof()
function, or packages such as lineprof or
aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x)     ## NOT for (i in seq_along) x[i] <- log(x[i])##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
       result[i] <- runif(1) * result[i - 1]
result##  [1] 0.2112226484 0.0837410720 0.0375219451 0.0087721750 0.0070520504
##  [6] 0.0052281030 0.0040075761 0.0020400718 0.0007054378 0.0002237723Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
for looplm.fit() rather than repeatedly fitting
the same design matrix.Re-use existing, tested code
tabulate(),
rowSums() and friends, %in%, …Here’s an inefficient function:
f0 <- function(n, a=2) {
    ## stopifnot(is.integer(n) && (length(n) == 1) &&
    ##           !is.na(n) && (n > 0))
    result <- numeric()
    for (i in seq_len(n))
        result <- c(result, a * log(i))
    result
}Use system.time() to investigate how this algorithm scales with n,
focusing on elapsed time.
system.time(f0(10000))##    user  system elapsed 
##   0.168   0.012   0.180n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")Remember the current ‘correct’ value, and an approximate time
n <- 10000
system.time(expected <- f0(n))##    user  system elapsed 
##   0.208   0.012   0.218head(expected)## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519Revise the function to hoist the common multiplier, a, out of the
loop. Make sure the result of the ‘optimization’ and the original
calculation are the same. Use the microbenchmark
package to compare the two versions
f1 <- function(n, a=2) {
    result <- numeric()
    for (i in seq_len(n))
        result <- c(result, log(i))
    a * result
}
identical(expected, f1(n))## [1] TRUElibrary(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)## Unit: milliseconds
##   expr      min       lq     mean   median       uq      max neval
##  f0(n) 115.2943 123.1257 125.0825 124.2671 128.2014 134.5239     5
##  f1(n) 116.1748 118.6345 120.6672 119.3346 124.0238 125.1682     5Adopt a ‘pre-allocate and fill’ strategy
f2 <- function(n, a=2) {
    result <- numeric(n)
    for (i in seq_len(n))
        result[[i]] <- log(i)
    a * result
}
identical(expected, f2(n))## [1] TRUEmicrobenchmark(f0(n), f2(n), times=5)## Unit: microseconds
##   expr        min         lq       mean     median         uq        max
##  f0(n) 113041.957 117630.568 138406.110 129078.581 151187.614 181091.832
##  f2(n)    934.442    938.751   1212.101   1144.617   1394.306   1648.391
##  neval
##      5
##      5Use an *apply() function to avoid having to explicitly pre-allocate,
and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
    a * sapply(seq_len(n), log)
identical(expected, f3(n))## [1] TRUEmicrobenchmark(f0(n), f2(n), f3(n), times=10)## Unit: microseconds
##   expr        min         lq       mean      median         uq        max
##  f0(n) 114909.182 116023.724 119635.840 120344.3835 123610.775 123799.268
##  f2(n)    926.806    931.197   1117.099    956.8945   1316.960   1689.219
##  f3(n)   3330.438   3414.339   4259.659   3588.0740   5019.677   7848.792
##  neval
##     10
##     10
##     10Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
    a * log(seq_len(n))
identical(expected, f4(n))## [1] TRUEmicrobenchmark(f0(n), f3(n), f4(n), times=10)## Unit: microseconds
##   expr        min         lq        mean     median         uq        max
##  f0(n) 113447.054 118092.762 129562.6032 121978.758 134142.406 173583.208
##  f3(n)   3310.615   3353.244   4198.5312   3502.275   5476.352   5836.909
##  f4(n)    225.965    229.301    385.5061    235.880    253.538   1693.834
##  neval
##     10
##     10
##     10f4() definitely seems to be the winner. How does it scale with n?
(Repeat several times)
n <- 10 ^ (5:8)                         # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")Any explanations for the different pattern of response?
Lessons learned:
*apply() functions help avoid need for explicit pre-allocation
and make opportunities for vectorization more apparent. This may
come at a small performance cost, but is generally worth itWhen data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.
Iteration
open(), read chunk(s), close().yieldSize argument to Rsamtools::BamFile()GenomicFiles::reduceByYield()Restriction
Exploit domain-specific formats
Rsamtools::ScanBamParam()Rsamtools::PileupParam()VariantAnnotation::ScanVcfParam()Use a data base
Iterate through files: GenomicFiles::reduceByYield()
suppressPackageStartupMessages({
    library(GenomicFiles)
    library(GenomicAlignments)
    library(Rsamtools)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(RNAseqData.HNRNPC.bam.chr14)
})
yield <-     # how to input the next chunk of data
    function(X, ...)
{
    readGAlignments(X)
}
map <-       # what to do to each chunk
    function(VALUE, ..., roi)
{
    olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE)
    count <- tabulate(subjectHits(olaps), subjectLength(olaps))
    setNames(count, names(roi))
}
reduce <- `+`   # how to combine mapped chunksImprovement: “yield factory” to keep track of how many records input
yieldFactory <-  # return a function with local state 
    function() 
{
    n_records <- 0L
    function(X, ...) {
        aln <- readGAlignments(X)
        n_records <<- n_records + length(aln)
        message(n_records)
        aln
    }
}Regions of interest, named like the chromosomes in the bam file.
exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")A function to iterate through a bam file.
count1 <- function(filename, roi) {
    message(filename)
    ## Create and open BAM file
    bf <- BamFile(filename, yieldSize=1000000)
    reduceByYield(bf, yieldFactory(), map, reduce, roi=roi)
}In action
## path to example bam files
library(RNAseqData.HNRNPC.bam.chr14)
bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
count <- count1(bam[[1]], exByTx)## /home/csama/R/x86_64-pc-linux-gnu-library/3.6/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam## 800484
## 800484hist(log10(count)) # drops 0's| Type | Example use | Name | Package | 
|---|---|---|---|
| .bed | Range annotations | BedFile() | rtracklayer | 
| .wig | Coverage | WigFile(),BigWigFile() | rtracklayer | 
| .gtf | Transcript models | GTFFile() | rtracklayer | 
| makeTxDbFromGFF() | GenomicFeatures | ||
| .2bit | Genomic Sequence | TwoBitFile() | rtracklayer | 
| .fastq | Reads & qualities | FastqFile() | ShortRead | 
| .bam | Aligned reads | BamFile() | Rsamtools | 
| .tbx | Indexed tab-delimited | TabixFile() | Rsamtools | 
| .vcf | Variant calls | VcfFile() | VariantAnnotation | 
## rtracklayer menagerie
suppressPackageStartupMessages(library(rtracklayer))
names(getClass("RTLFile")@subclasses)##  [1] "UCSCFile"         "GFFFile"          "BEDFile"         
##  [4] "WIGFile"          "BigWigFile"       "ChainFile"       
##  [7] "TwoBitFile"       "FastaFile"        "TabSeparatedFile"
## [10] "CompressedFile"   "GFF1File"         "GFF2File"        
## [13] "GFF3File"         "BEDGraphFile"     "BED15File"       
## [16] "BEDPEFile"        "NarrowPeakFile"   "BroadPeakFile"   
## [19] "BWFile"           "2BitFile"         "GTFFile"         
## [22] "GVFFile"          "GZFile"           "BGZFile"         
## [25] "BZ2File"          "XZFile"Notes
open(), close(), import() / yield() / read*().bai, bam index);
selection (‘columns’); restriction (‘rows’)*FileList() classes
reduceByYield() – iterate through a single large filebplapply() (BiocParallel) – perform independent
operations on several files, in parallelGenomicFiles()
reduceByRange(), reduceByFile(): collapse maps into summary
representationVcfStack()
?VcfStackA couple of caveats -
Iteration / restriction techniques keep the memory requirements under control while parallel evaluation distributes computational load across nodes. Keep in mind that parallel computations are still restricted by the amount of memory available on each node.
There is overhead in setting up and tearing down a cluster and more so when computing in distributed memory. For small calculations, the parallel overhead may outweigh the benefits with no improvement in performance.
Jobs that benefit the most from parallel execution are CPU-intensive and operate on data chunks that fits into memory.
BiocParallel provides a standardized interface for parallel evaluation and supports the major parallel computing styles: forks and processes on a single computer, ad hoc clusters, batch schedulers and cloud computing. By default, BiocParallel chooses a parallel back-end appropriate for the OS and is supported across Unix, Mac and Windows.
General ideas:
bplapply() instead of lapply()Argument BPPARAM influences how parallel evaluation occurs
MulticoreParam(): threads on a single (non-Windows) machineSnowParam(): processes on the same or different machinesBatchJobsParam(): resource scheduler on a clusterThis small example motivates the use of parallel execution and demonstrates how
bplapply() can be a drop in for lapply.
Use system.time() to explore how long this takes to execute as n
increases from 1 to 10. Use identical() and
microbenchmark to compare alternatives f0() and f1()
for both correctness and performance.
fun sleeps for 1 second, then returns i.
library(BiocParallel)
fun <- function(i) {
    Sys.sleep(1)
    i
}
## serial
f0 <- function(n)
    lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
    bplapply(seq_len(n), fun)BPREDOBiocParallel “catches and returns” errors along with
successful results. This exercise demonstrates how to access the
traceback() of a failed task and how to re-run the failed tasks with
‘BPREDO’. Full details on error handling, logging and debugging are
in the Errors, Logs and Debugging vignette.
param <- MulticoreParam(workers = 3) # Windows: SnowParam(workers=3)Call the sqrt() function across ‘X’; the second element is a character
and will throw and error.
X <- list(1, "2", 3)
res <- bplapply(X, sqrt, BPPARAM = param)## Error: BiocParallel errors
##   element index: 2
##   first error: non-numeric argument to mathematical functionIt’s also possible to catch the error and partially evaluated result
res <- bptry(bplapply(X, sqrt, BPPARAM=param))
res## [[1]]
## [1] 1
## 
## [[2]]
## <remote_error in FUN(...): non-numeric argument to mathematical function>
## traceback() available as 'attr(x, "traceback")'
## 
## [[3]]
## [1] 1.732051Re-run the failed results by repeating the call to bplapply() this
time with corrected input data and the partial results as
‘BPREDO’. Only the failed values are re-run.
X.redo <- list(1, 2, 3)
bplapply(X.redo, sqrt, BPREDO = res)## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051Alternatively, switch to a SerialParam() and debug the specific
element that caused the error.
> fun = function(i) { browser(); sqrt(i) }
> bplapply(X, fun, BPREDO=res, BPPARAM=SerialParam())
resuming previous calculation ... 
Called from: FUN(...)
Browse[1]> 
debug at #1: sqrt(i)
Browse[2]> i
[1] "2"
Browse[2]> i = 2
Browse[2]> c
[[1]]
[1] 1
[[2]]
[1] 1.414214
[[3]]
[1] 1.732051BiocParallel uses the futile.logger package for logging. The package has a flexible system for filtering messages of varying severity thresholds such as INFO, DEBUG, ERROR etc. (For a list of all thresholds see the ?bpthreshold man page.) BiocParallel captures messages written in futile.logger format as well as messages written to stdout and stderr.
This function does some argument checking and has DEBUG, WARN and INFO-level log messages.
FUN <- function(i) {
  flog.debug(paste0("value of 'i': ", i))
  if (!length(i)) {
      flog.warn("'i' is missing")
      NA 
  } else if (!is(i, "numeric")) {
      flog.info("coercing to numeric")
      as.numeric(i)
  } else {
      i
  }
}Turn logging on in the param and set the threshold to WARN.
param <- SnowParam(3, log = TRUE, threshold = "WARN")
bplapply(list(1, "2", integer()), FUN, BPPARAM = param)Lower the threshold to INFO and DEBUG (i.e., use bpthreshold<-) to see how
messages are filtered on severity.
For long running jobs or untested code it can be useful to set a time limit. The timeout field is the time, in seconds, allowed for each worker to complete a task. If a task takes longer than timeout the worker returns an error.
timeout can be set during param construction,
param <- SnowParam(timeout = 20)
param## class: SnowParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 20; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCKor with the setter:
bptimeout(param) <- 2 
param## class: SnowParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: SOCKUse this function to explore different _timeout_s over a numeric vector of ‘X’
values with bplapply(). ‘X’ values less than timeout return successfully
while those over threshold return an error.
fun <- function(i) {
  Sys.sleep(i)
  i
}Distribute files over workers: GenomicFiles::reduceByFile()
The previous counting example used GenomicFiles::reduceByYield() which
operates on a single file and implements a yield, map, reduce paradigm.
In this exercise we’ll use GenomicFiles::reduceByFile() which uses
bplaply() under the hood to operate on multiple files in parallel.
Primary arguments to reduceByFile() are a set of files and a set of ranges.
Files are sent to the workers and data subsets are extracted based on the
ranges. The bulk of the work is done in the MAP function and an optional
REDUCE function combines the output on each worker.
suppressPackageStartupMessages({
    library(BiocParallel)
    library(GenomicFiles)
    library(GenomicAlignments)
    library(Rsamtools)
})On Unix or Mac, configure a MulticoreParam() with 4 workers. Turn on
logging and set a timeout of 60 seconds.
param <- MulticoreParam(4, log = TRUE, timeout = 60)On Windows do the same with SnowParam():
param <- SnowParam(4, log = TRUE, timeout = 60)Point to the collection of bam files.
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bfl <- BamFileList(fls)Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control.
ranges <- GRanges("chr14", IRanges(c(28477797, 29527797, 32448354),
                                  c(29477797, 30527797, 33448354)))The MAP function reads in records and counts overlaps. readGAlignments()
reads in bam records that overlap with any portion of the ranges defined in the
scanBamParam (i.e., they could be overlapping the start or the end). Once
we’ve got the records in R, we want to count only those that fall ‘within’
the ranges.
MAP <- function(range, file, ...) {
    library(GenomicAlignments)         ## readGAlignments(), ScanBamParam()
    param = ScanBamParam(which=range)  ## restriction
    gal = readGAlignments(file, param=param)
    ## overlaps
    olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE)
    tabulate(subjectHits(olaps), subjectLength(olaps))
} Count …
cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param)## ############### LOG OUTPUT ###############
## Task: 1
## Node: 1
## Timestamp: 2019-07-26 13:45:44
## Success: TRUE
## 
## Task duration:
##    user  system elapsed 
##   0.556   0.092   0.718 
## 
## Memory used:
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  8007208 427.7   14721708 786.3  9351876 499.5
## Vcells 16688881 127.4   43009392 328.2 42958239 327.8
## 
## Log messages:
## INFO [2019-07-26 13:45:44] loading futile.logger package
## 
## stderr and stdout:## ############### LOG OUTPUT ###############
## Task: 2
## Node: 2
## Timestamp: 2019-07-26 13:45:45
## Success: TRUE
## 
## Task duration:
##    user  system elapsed 
##   0.584   0.076   0.756 
## 
## Memory used:
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  8007971 427.7   14721708 786.3  9351876 499.5
## Vcells 16690575 127.4   43009392 328.2 42958239 327.8
## 
## Log messages:
## INFO [2019-07-26 13:45:44] loading futile.logger package
## 
## stderr and stdout:## ############### LOG OUTPUT ###############
## Task: 3
## Node: 3
## Timestamp: 2019-07-26 13:45:46
## Success: TRUE
## 
## Task duration:
##    user  system elapsed 
##   0.576   0.096   0.805 
## 
## Memory used:
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  8008004 427.7   14721708 786.3  9351876 499.5
## Vcells 16690653 127.4   43009392 328.2 42958239 327.8
## 
## Log messages:
## INFO [2019-07-26 13:45:44] loading futile.logger package
## 
## stderr and stdout:## ############### LOG OUTPUT ###############
## Task: 4
## Node: 4
## Timestamp: 2019-07-26 13:45:46
## Success: TRUE
## 
## Task duration:
##    user  system elapsed 
##   0.588   0.084   0.813 
## 
## Memory used:
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  8008037 427.7   14721708 786.3  9351876 499.5
## Vcells 16690729 127.4   43009392 328.2 42958239 327.8
## 
## Log messages:
## INFO [2019-07-26 13:45:44] loading futile.logger package
## 
## stderr and stdout:The result is a list the same length as the number of files.
length(cts)## [1] 8Each list element is the length of the number of ranges.
lengths(cts)## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 
##         3         3         3         3         3         3         3 
## ERR127305 
##         3Tables of counts for each range are extracted with
[[; use simplifyToArray() to get a matrix of counts
cts[[1]]## [[1]]
## [1] 429
## 
## [[2]]
## [1] 22
## 
## [[3]]
## [1] 1338simplify2array(cts)##      ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
## [1,] 429       477       508       452       516       721       713      
## [2,] 22        18        37        24        14        26        30       
## [3,] 1338      1938      1618      1683      1988      2526      2372     
##      ERR127305
## [1,] 544      
## [2,] 20       
## [3,] 1785Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226. http://arxiv.org/abs/1409.2864v1
BiocParallel: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html
GenomicFiles: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html