DelayedDataFrame 1.0.0
As the genetic/genomic data are having increasingly larger profile,
the annotation file are also getting much bigger than expected. the
memory space in R has been an obstable for fast and efficient data
processing, because most available R or Bioconductor packages are
developed based on in-memory data manipulation. With some newly
developed data structure as HDF5 or GDS, and the R interface
of DelayedArray to represent on-disk data structures with
different back-end in R-user-friendly array data structure (e.g.,
HDF5Array,GDSArray), the high-throughput genetic/genomic data
are now being able to easily loaded and manipulated within
R. However, the annotation files for the samples and features inside
the high-through data are also getting unexpectedly larger than
before. With an ordinary data.frame or DataFrame, it is still
getting more and more challenging for any analysis to be done within
R. So here we have developed the DelayedDataFrame, which has the
very similar characteristics as data.frame and DataFrame. But at
the same time, all column data could be optionally saved on-disk
(e.g., in DelayedArray structure with any back-end). Common
operations like constructing, subsetting, splitting, combining could
be done in the same way as DataFrame. This feature of
DelayedDataFrame could enable efficient on-disk reading and
processing of the large-scale annotation files, and at the same,
signicantly saves memory space with common DataFrame metaphor in R
and Bioconductor.
Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedDataFrame")
The development version is also available to download through github:
BiocManager::install("Bioconductor/DelayedDataFrame")
Load the package into R session before using:
library(DelayedDataFrame)
DelayedDataFrame extends the DataFrame data structure, with an
additional slot called lazyIndex, which saves all the mapping
indexes for each column of the data inside DelayedDataFrame. It is
similar to data.frame in terms of construction, subsetting,
splitting, combining… The rownames are having same feature as
DataFrame. It will not be given automatically, but only by
explicitly specify in the constructor function DelayedDataFrame(, row.names=...) or using the slot setter function rownames()<-.
Here we use the GDSArray data as example to show the
DelayedDataFrame characteristics. GDSArray is a Bioconductor
package that represents GDS files as objects derived from the
DelayedArray package and DelayedArray class. It carries the
on-disk data path and represent the GDS nodes in a
DelayedArray-derived data structure.
The GDSArray() constructor takes 2 arguments: the file path and the
GDS node name inside the GDS file.
library(GDSArray)
## Loading required package: gdsfmt
##
## Attaching package: 'GDSArray'
## The following object is masked from 'package:utils':
##
## example
file <- SeqArray::seqExampleFileName("gds")
gdsnodes(file)
## [1] "description" "sample.id"
## [3] "variant.id" "position"
## [5] "chromosome" "allele"
## [7] "genotype/data" "genotype/~data"
## [9] "genotype/extra.index" "genotype/extra"
## [11] "phase/data" "phase/~data"
## [13] "phase/extra.index" "phase/extra"
## [15] "annotation/id" "annotation/qual"
## [17] "annotation/filter" "annotation/info/AA"
## [19] "annotation/info/AC" "annotation/info/AN"
## [21] "annotation/info/DP" "annotation/info/HM2"
## [23] "annotation/info/HM3" "annotation/info/OR"
## [25] "annotation/info/GP" "annotation/info/BN"
## [27] "annotation/format/DP/data" "annotation/format/DP/~data"
## [29] "sample.annotation/family"
varid <- GDSArray(file, "annotation/id")
AA <- GDSArray(file, "annotation/info/AA")
We use an ordinary character vector and the GDSArray objects to
construct a DelayedDataFrame object.
ddf <- DelayedDataFrame(varid, AA)
The slots of DelayedDataFrame could be accessed by lazyIndex(),
nrow(), rownames() (if not NULL) functions. With a newly
constructed DelayedDataFrame object, the initial value of
lazyIndex slot will be NULL for all columns.
lazyIndex(ddf)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
nrow(ddf)
## [1] 1348
rownames(ddf)
## NULL
lazyIndex slotThe lazyIndex slot is in LazyIndex class, which is defined in the
DelayedDataFrame package and extends the SimpleList class. The
listData slot saves unique indexes for all the columns, and the
index slots saves the position of index in listData slot for each
column in DelayedDataFrame object. In the above example, with an
initial construction of DelayedDataFrame object, the index for each
column will all be NULL, and all 3 columns points the NULL values
which sits in the first position in listData slot of lazyIndex.
lazyIndex(ddf)@listData
## [[1]]
## NULL
lazyIndex(ddf)@index
## [1] 1 1
Whenever an operation is done (e.g., subsetting), the listData slot
inside the DelayedDataFrame stays the same, but the lazyIndex slot
will be updated, so that the show method, further statistical
calculation will be applied to the subsetting data set. For example,
here we subset the DelayedDataFrame object ddf to keep only the
first 5 rows, and see how the lazyIndex works. As shown in below,
after subsetting, the listData slot in ddf1 stays the same as
ddf. But the subsetting operation was recorded in the lazyIndex
slot, and the slots of lazyIndex, nrows and rownames (if not
NULL) are all updated. So the subsetting operation is kind of
delayed.
ddf1 <- ddf[1:20,]
identical(ddf@listData, ddf1@listData)
## [1] TRUE
lazyIndex(ddf1)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
nrow(ddf1)
## [1] 20
Only when functions like DataFrame(), or as.list(), the
lazyIndex will be realized and DelayedDataFrame returned.
We will show the realization in the following coercion method section.
The common methods on data.frame or DataFrame are also defined on
DelayedDataFrame class, so that they behave similarily on
DelayedDataFrame objects.
Coercion methods between DelayedDataFrame and other data structures
are defined. When coercing from ANY to DelayedDataFrame, the
lazyIndex slot will be added automatically, with the initial NULL
value of indexes for each column.
as(letters, "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## X
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
as(DataFrame(letters), "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## letters
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
(a <- as(list(a=1:5, b=6:10), "DelayedDataFrame"))
## DelayedDataFrame with 5 rows and 2 columns
## a b
## <integer> <integer>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
When coerce DelayedDataFrame into other data structure, the
lazyIndex slot will be realized and the new data structure
returned. For example, when DelayedDataFrame is coerced into a
DataFrame object, the listData slot will be updated according to
the lazyIndex slot.
df1 <- as(ddf1, "DataFrame")
df1@listData
## $varid
## <20> DelayedArray object of type "character":
## 1 2 3 . 19
## "rs111751804" "rs114390380" "rs1320571" . "rs61751002"
## 20
## "rs6691840"
##
## $AA
## <20> DelayedArray object of type "character":
## 1 2 3 . 19 20
## "T" "G" "A" . "C" "C"
dim(df1)
## [1] 20 2
[two-dimensional [ subsetting on DelayedDataFrame objects by
integer, character, logical values all work.
ddf[, 1, drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
ddf[, "AA", drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## AA
## <GDSArray>
## 1 T
## 2 G
## 3 A
## ... ...
## 1346 C
## 1347 G
## 1348 G
ddf[, c(TRUE,FALSE), drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
When subsetting using [ on an already subsetted DelayedDataFrame
object, the lazyIndex, nrows and rownames(if not NULL) slot will
be updated.
(a <- ddf1[1:10, 2, drop=FALSE])
## DelayedDataFrame with 10 rows and 1 column
## AA
## <DelayedArray>
## 1 T
## 2 G
## 3 A
## ... ...
## 8 C
## 9 G
## 10 G
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## index of each column:
## [1] 1
nrow(a)
## [1] 10
[[The [[ subsetting will take column subscripts for integer or
character values, and return corresponding columns in it’s original
data format.
ddf[[1]]
## <1348> GDSArray object of type "character":
## 1 2 3 . 1347
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## 1348
## "rs5771206"
ddf[["varid"]]
## <1348> GDSArray object of type "character":
## 1 2 3 . 1347
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## 1348
## "rs5771206"
identical(ddf[[1]], ddf[["varid"]])
## [1] TRUE
rbind/cbindWhen doing rbind, the lazyIndex of input arguments will be
realized and a new DelayedDataFrame with NULL lazyIndex will be
returned.
ddf2 <- ddf[21:40, ]
(ddfrb <- rbind(ddf1, ddf2))
## DelayedDataFrame with 40 rows and 2 columns
## varid AA
## <DelayedArray> <DelayedArray>
## 1 rs111751804 T
## 2 rs114390380 G
## 3 rs1320571 A
## ... ... ...
## 38 rs1886116 C
## 39 rs115917561 G
## 40 rs61751016 T
lazyIndex(ddfrb)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
cbind of DelayedDataFrame objects will keep all existing
lazyIndex of input arguments and carry into the new
DelayedDataFrame object.
(ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], AA=ddf1[, 2, drop=FALSE]))
## DelayedDataFrame with 20 rows and 2 columns
## varid AA.AA
## <DelayedArray> <DelayedArray>
## 1 rs111751804 T
## 2 rs114390380 G
## 3 rs1320571 A
## ... ... ...
## 18 rs115614983 T
## 19 rs61751002 C
## 20 rs6691840 C
lazyIndex(ddfcb)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GDSArray_1.4.0 gdsfmt_1.20.0 DelayedDataFrame_1.0.0
## [4] DelayedArray_0.10.0 BiocParallel_1.18.0 IRanges_2.18.0
## [7] matrixStats_0.54.0 S4Vectors_0.22.0 BiocGenerics_0.30.0
## [10] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 XVector_0.24.0 knitr_1.22
## [4] magrittr_1.5 zlibbioc_1.30.0 GenomicRanges_1.36.0
## [7] lattice_0.20-38 GenomeInfoDb_1.20.0 stringr_1.4.0
## [10] tools_3.6.0 grid_3.6.0 xfun_0.6
## [13] SeqArray_1.24.0 htmltools_0.3.6 yaml_2.2.0
## [16] digest_0.6.18 SNPRelate_1.18.0 bookdown_0.9
## [19] Matrix_1.2-17 GenomeInfoDbData_1.2.1 BiocManager_1.30.4
## [22] bitops_1.0-6 RCurl_1.95-4.12 evaluate_0.13
## [25] rmarkdown_1.12 stringi_1.4.3 compiler_3.6.0
## [28] Biostrings_2.52.0