Motivation and relevance
Large data requires computationally efficient approaches to analysis.
Understanding the way R works provides some insight into how we can write efficient code, expanding the range of problems that we can reasonably tackle in R. It is not unusual to write efficient R code that is as correct but 100x faster than naive code.
Parallel evaluation allows us to scale efficient code to exploit computational resources both on our personal computers and in high performance environments, further expanding the scale of analysis reasonably undertaken in R.
logical(), integer(), numeric(), complex(), character(),
raw(), list()data.frame(), matrix()Attributes
str() (including 2nd argument) and dput()
Exercise: what's a factor()?
environment()get() in the environment are searched for in the parent,
iterativelyExercise: finding variables
pi? pi <- 3.2, where does your symbol pi reside?pi with a more precise value?Exercise: NAMESPACE
getNamespace("IRanges") to retrieve the IRanges package name
spacels() to list the content of the namespaceparent.env() to recursively discover the search path for a
symbol mentioned in the name spacefunctionArgument basics
Function environments
<<- ?Exercise: bank account: explain…
account <- function(initial=0) {
    available <- initial
    list(deposit=function(amount) {
        available <<- available + amount
        available
    }, balance=function() {
         available
    })
}
my_acct <- account()
my_acct$deposit(100)
## [1] 100
your_acct <- account(20)
my_acct$deposit(200)
## [1] 300
my_acct$balance()
## [1] 300
your_acct$balance()
## [1] 20
my_acct$withdraw.my_bank to manage a number of accounts.SEXP
type (mentioned further below)
vector
   o length, [, [<-, [[, [[<-, names, names<-, class, class<-, ...
-- raw()                   RAWSXP
-- logical()               LGLSXP
-- numeric()               REALSXP
   -- integer()            INTSXP
-- complex()               CPLXSXP
-- character()             STRSXP
-- list()                  VECSXP
   -- data.frame()
   -- ... many S3 objects
-- structure()
   -- array()
      -- matrix()
-- expression()            EXPRSXP
environment (new.env())    ENVSXP
   o ls
   o [[, [[<-
closure (e.g., function)   CLOSSXP
S4 class                   S4SXP
...
Some tools
.Internal(inspect())tracemem()Exercise: explain…
x <- 1:5; tracemem(x)
x[1] <- 2L
x[1] <- 2
x <- y <- seq(1, 5); tracemem(x)
x[1] <- 2L
df <- data.frame(x=1:5, y=5:1)
tracemem(df); tracemem(df$x)
df[1,1] <- 2
m <- matrix(1:10, 2); tracemem(m)
m[1, 1] <- 2L
f <- function(x) x[1]
g <- function(x) { x[1] <- 2L; x }
tracemem(x <- 1:5); f(x)
tracemem(x <- 1:5); g(x)
x <- 1:5 associates the symbol x with the value
1:5 in a particular (e.g., .GlobalEnv) environment.x points to a location in memory, where there's a C struct.struct is an S-expression (SEXP) atom that
represents the data values (1:5) as well as information about the
data (e.g., that they are integers, hence INTSXP). We can peak
into the S-expression structure withx <- 1:5
.Internal(inspect(x))
## @1050cc088 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
x is located (following the @), that this is an
instance of type INTSXP (integer), and that it has length len=5.Exercise: Use .Internal(inspect()) to discover other common
S-expression types, in addition to INTSXP. Some examples:
.Internal(inspect(pi))
.Internal(inspect(data.frame()))
.Internal(inspect(function() {}))
.Internal(inspect(expression(1 + 2)))
rm() or garbage collect gc() (many experienced
R programmers never use rm()).Uses NAMED rather than reference counts
.Internal(inspect(1:5))
## @105199ff8 13 INTSXP g0c3 [] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(x <- 1:5))
## @1050f6838 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(y <- x <- 1:5))
## @1051bc768 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
Copy-on-write illusion
Example from Rowe:
'clamp' data so that values are no greater than 5 standard deviations from the mean.
Data:
set.seed(123)
x <- rnorm(10000000)
Find values: Declarative
x[abs(x) > 5 * sd(x)]
## [1] -5.051  5.213  5.348  5.227
Imperative
ans <- numeric()
for (xi in x)
    if (xi > 5 * sd(x))
        ans <- c(ans, xi)
Clamp: Declarative
x[abs(x) > 5 * sd(x)] <- 5 * sd(x)
Imperative
for (i in seq_along(x))
    if (abs(x[i]) > 5 * sd(x))
        x[i] <- 5 * sd(x)
Question: What are the merits of declarative vs. imperative styles?
Exercise: Few R functions are truly functional, but its possible to to recognize 'more' versus 'less' functional ways of writing R code. For the following,
df <- data.frame(x=1:5, y=5:1)
x0 <- sapply(names(df), function(x) sqrt(df[[x]]))
x1 <- sapply(names(df), function(x, df) sqrt(df[[x]]), df)
x3 <- sapply(df, function(x, fun) fun(x), sqrt)
x2 <- sapply(df, sqrt)
df$x <- 5:1, consistent with
functional programming?